Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-designing-user-security-oracle-peoplesoft-applications
Packt
04 Jul 2011
12 min read
Save for later

Designing User Security for Oracle PeopleSoft Applications

Packt
04 Jul 2011
12 min read
Understanding User security Before we get into discussing the PeopleSoft security, let's spend some time trying to set the context for user security. Whenever we think of a complex system like PeopleSoft Financial applications with potentially hundreds of users, the following are but a few questions that face us: Should a user working in billing group have access to transactions, such as vouchers and payments, in Accounts Payable? Should a user who is a part of North America business unit have access to the data belonging to the Europe business unit? Should a user whose job involves entering vouchers be able to approve and pay them as well? Should a data entry clerk be able to view departmental budgets for the organization? These questions barely scratch the surface of the complex security considerations of an organization. Of course, there is no right or wrong answer for such questions, as every organization has its own unique security policies. What is more important is the fact that we need a mechanism that can segregate the access to system features. We need to enforce appropriate controls to ensure users can access only the features they need. Implementation challenge Global Vehicles' Accounts Payable department has three different types of users – Mary, who is the AP manager; Amy, who is the AP Supervisor; and Anton, who is the AP Clerk. These three types of users need to have the following access to PeopleSoft features: User typeAccess to featureDescriptionAP ClerkVoucher entryA clerk should be able to access system pages to enter vouchers from various vendors.AP SupervisorVoucher entry Voucher approval Voucher postingA supervisor also needs to have access to enter vouchers. He/she should review and approve each voucher entered by the clerk. Also, the supervisor should be able to execute the Voucher Post process that creates accounting entries for vouchers.AP ManagerPay cycle Voucher approvalAP manager should be the only one who can execute the Pay Cycle (a process that prints checks to issue payments to vendors). Manager (in addition to the Supervisor) should also have the authority to approve vouchers. Note that this is an extremely simplified scenario that does not really include all the required features in Accounts Payable. Solution We will design a security matrix that uses distinct security roles. We'll configure permission lists, roles, and user profiles to limit user access to required system features. PeopleSoft security matrix is a three-level structure consisting of Permission lists (at the bottom), Roles (in the middle) and User profiles (at the top). The following illustration shows how it is structured: We need to create a User Profile for each user of the system. This user profile can have as many Roles as needed. For example, a user can have roles of Billing Supervisor and Accounts Receivable Payment Processor, if he/she approves customer invoices as well as processes customer payments. Thus, the number of roles that a user should be granted depends entirely on his/her job responsibilities. Each role can have multiple permission lists. A Permission list determines which features can be accessed by a Role. We can specify which pages can be accessed, the mode in which they can be accessed (read only/add/update) in a permission list. In a nutshell, we can think of a Role as a grouping of system features that a user needs to access, while a Permission list defines the nature of access to those system features. Expert tip Deciding how many and which roles should be created needs quite a bit of thinking about. How easy will it be to maintain them in future? Think of the scenarios where a function needs to be removed from a role and added to another. How easy would it be to do so? As a rule of thumb, you should map system features to roles in such a way that they are non-overlapping. Similarly, map system pages to permission lists so that they are mutually exclusive. This simplifies user security maintenance to a large extent. Note that although it is advisable, it may not always be possible. Organizational requirements are sometimes too complicated to achieve this. However, a PeopleSoft professional should try to build a modular security design to the extent possible. Now, let's try to design our security matrix for the hypothetical scenario presented previously and test our thumb rule of mutually exclusive roles and permission lists. What do you observe as far as required system features for a role are concerned? You are absolutely right if you thought that some of system features (such as Voucher entry) are common across roles. Which roles should we design for this situation? Going by the principle of mutually exclusive roles, we can map system features to the permission lists (and in turn roles) without overlapping them. We'll denote our roles by the prefix 'RL' and permission lists by the prefix 'PL'. Thus, the mapping may look something like this: RolePermission listSystem featureRL_Voucher_EntryPL_Voucher_EntryVoucher EntryRL_Voucher_ApprovalPL_Voucher_ApprovalVoucher ApprovalRL_Voucher_PostingPL_Voucher_PostingVoucher PostingRL_Pay_CyclePL_Pay_CyclePay Cycle So, now we have created the required roles and permission lists, and attached appropriate permission lists to each of the roles. In this example, due to the simple scenario, each role has only a single permission list assigned to it. Now as the final step, we'll assign appropriate roles to each user's User profile. UserRoleSystem feature accessedMaryRL_Voucher_ApprovalVoucher ApprovalRL_Pay_CyclePay CycleAmyRL_Voucher_EntryVoucher EntryRL_Voucher_ApprovalVoucher ApprovalRL_Voucher_PostingVoucher PostingAntonRL_Voucher_EntryVoucher Entry Now, as you can see, each user has access to the appropriate system feature through the roles and in turn, permission lists attached to their user profiles. Can you think of the advantage of our approach? Let's say that a few months down the line, it is decided that Mary (AP Manager) should be the only one approving vouchers, while Amy (AP Supervisor) should also have the ability to execute pay cycles and issue payments to vendors. How can we accommodate this change? It's quite simple really – we'll remove the role RL_Voucher_Approval from Amy's user profile and add the role RL_Pay_Cycle to her profile. So now the security matrix will look like this: UserRoleSystem feature accessedMaryRL_Voucher_ApprovalVoucher ApprovalRL_Pay_CyclePay CycleAmyRL_Voucher_EntryVoucher EntryRL_Pay_CyclePay CycleRL_Voucher_PostingVoucher PostingAntonRL_Voucher_EntryVoucher Entry Thus, security maintenance becomes less cumbersome when we design roles and permission lists with non-overlapping system features. Of course, this comes with a downside as well. This approach results in a large number of roles and permission lists, thereby increasing the initial configuration effort. The solution that we actually design for an organization needs to balance these two objectives. Expert tip Having too many permission lists assigned to a User Profile can adversely affect the system performance. PeopleSoft recommends 10-20 permission lists per user. Configuring permission lists Follow this navigation to configure permission lists: PeopleTools | Security | Permissions & Roles | Permission Lists The following screenshot shows the General tab of the Permission List set up page. We can specify the permission list description on this tab. We'll go through some of the important configuration activities for the Voucher Entry permission list discussed previously. Can Start Application Server?: Selecting this field enables a user with this permission list to start PeopleSoft application servers (all batch processes are executed on this server). A typical system user does not need this option. Allow Password to be Emailed?: Selecting this field enables users to receive forgotten passwords through e-mail. Leave the field unchecked to prevent unencrypted passwords from being sent in e-mails. Time-out Minutes – Never Time-out and Specific Time-out: These fields determine the number of minutes of inactivity after which the system automatically logs out the user with this permission list. The following screenshot shows the Pages tab of the permission list page. This is the most important place where we specify the menus, components, and ultimately the pages that a user can access. In PeopleSoft parlance, a component means collection of pages that are related to each other. In the previous screenshot, you are looking at a page. You can also see other related pages used to configure permission lists – General, PeopleTools, Process, and so on. All these pages constitute a component. A menu is a collection of various components. It is typically related to a system feature, such as 'Enter Voucher Information' as you can see in the screenshot. Thus, to grant a page access, we need to enter its component and menu details. Expert tip In order to find out the component and menu for a page, press CTRL+J when it is open in the internet browser. Menu Name: This is where we need to specify all the menus to which a user needs to have access. Note that a permission list can grant access to multiple menus, but our simple example includes only one system feature (Voucher entry) and in turn, only one menu. Click the + button to add and select more menus. Edit Component hyperlink: Once we select a menu, we need to specify which components under it should be accessed by the permission list. Click the Edit Components hyperlink to proceed. The system opens the Component Permissions page, as shown in the following screenshot: In this page, the system shows all components under the selected menu. The previous screenshot shows only a part of the component list under the 'Enter Voucher Information' menu. Voucher entry pages for which we need to grant access exist under a component named VCHR_EXPRESS. Click the Edit Pages hyperlink to grant access to specific pages under a given component. The system opens the Page Permissions page, as shown in the following screenshot: The given screenshot shows the partial list of pages under the Voucher Entry component. Panel Item Name: This field shows the page name to which access is to be granted. Authorized?: In order to enable a user to access a page, select this checkbox. As you can see, we have authorized this permission list to access six pages in this component. Display Only: Clicking this checkbox allows the user to get read-only access for the given page. He/she cannot make any changes to the data on this page. Add: Selecting this checkbox enables the user to add a new transaction (in this case new vouchers). Update/Display: Selecting this checkbox enables the user to retrieve the current effective dated row. He/she can also add future effective dated rows in addition to modifying them. Update/Display All: This option gives the user all of the privileges of the Update/Display option. In addition he/she can retrieve past effective dated rows as well. Correction: This option enables the user to perform all the possible operations; that is, to retrieve, modify or add past, current, and future effective dated rows. Effective dates are usually relevant for master data set ups. It drives when a particular value comes into effect. A vendor may be set up with effective date of 1/1/2001, so thatit comes into effect from that date. Now assume that its name is slated to change on 1/1/2012. However, we can add a new vendor row with the new name and the appropriate effective date. The system automatically starts using the new name from 1/1/2012. Note that there can be multiple future effective dated rows, but only one current row. The next tab PeopleTools contains configuration options that are more relevant for technical developers. As we are concentrating on business users of the system, we'll not discuss them. Click the Process tab. As shown in the following screenshot, this tab is used to configure options for Process groups. A process group is a collection of batch processes belonging to a specific internal department or a business process. For example, PeopleSoft delivers a process group ARALL that includes all Accounts Receivable batch processes. PeopleSoft delivers various pre-configured process groups; however, we can create our own process groups depending on the organization's requirements. Click the Process Group Permissions hyperlink. The system opens a new page where we can select as many process groups as needed. When a process group is selected for a permission list, it enables the users to execute batch and online processes that are part of it. The following screenshot shows the Sign-on Times tab. This tab controls the time spans when a user with this permission list can sign-on to the system. We can enforce specific days or specific time spans for a particular day when users can sign-on. In the case of our permission list, there are no such limits and users with this permission list will be able to sign-on anytime on all days of the week. The next tab on this page is the Component Interface. The component interface is a PeopleSoft utility that automates bulk data entry into PeopleSoft pages. We can select component interface values on this tab, so that users with this permission list have access rights to use them. Due to the highly technical nature of the activities involved, we will not discuss the Web Libraries, Web Services, Mass change, Links, and Audit tabs. Oracle offers exhaustive documentation on PeopleSoft applications at the following here. The next important tab on the permission list page is Query. On this tab, the system shows two hyperlinks: Access Group Permissions and Query Profile. Click the Access Group Permissions hyperlink. The system opens the Permission List Access Groups page, as shown in the next screenshot. This page controls which database tables can be accessed by users to create database queries, using a PeopleSoft tool called Query Manager. Tree Name: A tree is a hierarchical structure of database tables. As shown in the screenshot, the tree QUERY_TREE_AP groups all AP tables. Access Group: Each tree has multiple nodes called access groups. These access groups are just logical groups of tables within a tree. In the screenshot, VOUCHERS is a group of voucher-related tables within the AP table tree. With this configuration, users will be able to create database queries on voucher tables in AP. Using the Query Profile hyperlink, we can set various options that control how users can create queries (such as if he/she can use joins, unions, and so on in queries, how many database rows can be fetched by the query, if the user can copy the query output to Microsoft Excel, and so on).
Read more
  • 0
  • 0
  • 7038

article-image-flink-complex-event-processing
Packt
16 Jan 2017
13 min read
Save for later

Flink Complex Event Processing

Packt
16 Jan 2017
13 min read
In this article by Tanmay Deshpande, the author of the book Mastering Apache Flink, we will learn the Table API provided by Apache Flink and how we can use it to process relational data structures. We will start learning more about the libraries provided by Apache Flink and how we can use them for specific use cases. To start with, let's try to understand a library called complex event processing (CEP). CEP is a very interesting but complex topic which has its value in various industries. Wherever there is a stream of events expected, naturally people want to perform complex event processing in all such use cases. Let's try to understand what CEP is all about. (For more resources related to this topic, see here.) What is complex event processing? CEP is a technique to analyze streams of disparate events occurring with high frequency and low latency. These days, streaming events can be found in various industries, for example: In the oil and gas domain, sensor data comes from various drilling tools or from upstream oil pipeline equipment In the security domain, activity data, malware information, and usage pattern data come from various end points In the wearable domain, data comes from various wrist bands with information about your heart beat rate, your activity, and so on In the banking domain, data from credit cards usage, banking activities, and so on It is very important to analyze the variation patterns to get notified in real time about any change in the regular assembly. CEP is able to understand the patterns across the streams of events, sub-events, and their sequences. CEP helps to identify meaningful patterns and complex relationships among unrelated events, and sends notifications in real and near real time to avoid any damage: The preceding diagram shows how the CEP flow works. Even though the flow looks simple, CEP has various abilities such as: Ability to produce results as soon as the input event stream is available Ability to provide computations like aggregation over time and timeout between two events of interest Ability to provide real time/near real time alerts and notifications on detection of complex event patterns Ability to connect and correlate heterogeneous sources and analyze patterns in them Ability to achieve high throughput, low latency processing There are various solutions available in the market. With big data technology advancements, we have multiple options like Apache Spark, Apache Samza, Apache Beam, among others, but none of them have a dedicated library to fit all solutions. Now let us try to understand what we can achieve with Flink's CEP library. Flink CEP Apache Flink provides the Flink CEP library which provides APIs to perform complex event processing. The library consists of the following core components: Event stream Pattern definition Pattern detection Alert generation Flink CEP works on Flink's streaming API called DataStream. A programmer needs to define the pattern to be detected from the stream of events and then Flink's CEP engine detects the pattern and takes appropriate action, such as generating alerts. In order to get started, we need to add following Maven dependency: <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-cep-scala_2.10 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-cep-scala_2.10</artifactId> <version>1.1.2</version> </dependency> Event stream A very important component of CEP is its input event stream. We have seen details of DataStream API. Now let's use that knowledge to implement CEP. The very first thing we need to do is define a Java POJO for the event. Let's assume we need to monitor a temperature sensor event stream. First we define an abstract class and then extend this class. While defining the event POJOs we need to make sure that we implement the hashCode() and equals() methods, as while comparing the events, compile will make use of them. The following code snippets demonstrate this. First, we write an abstract class as shown here: package com.demo.chapter05; public abstract class MonitoringEvent { private String machineName; public String getMachineName() { return machineName; } public void setMachineName(String machineName) { this.machineName = machineName; } @Override public int hashCode() { final int prime = 31; int result = 1; result = prime * result + ((machineName == null) ? 0 : machineName.hashCode()); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; MonitoringEvent other = (MonitoringEvent) obj; if (machineName == null) { if (other.machineName != null) return false; } else if (!machineName.equals(other.machineName)) return false; return true; } public MonitoringEvent(String machineName) { super(); this.machineName = machineName; } } Then we write the actual temperature event: package com.demo.chapter05; public class TemperatureEvent extends MonitoringEvent { public TemperatureEvent(String machineName) { super(machineName); } private double temperature; public double getTemperature() { return temperature; } public void setTemperature(double temperature) { this.temperature = temperature; } @Override public int hashCode() { final int prime = 31; int result = super.hashCode(); long temp; temp = Double.doubleToLongBits(temperature); result = prime * result + (int) (temp ^ (temp >>> 32)); return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (!super.equals(obj)) return false; if (getClass() != obj.getClass()) return false; TemperatureEvent other = (TemperatureEvent) obj; if (Double.doubleToLongBits(temperature) != Double.doubleToLongBits(other.temperature)) return false; return true; } public TemperatureEvent(String machineName, double temperature) { super(machineName); this.temperature = temperature; } @Override public String toString() { return "TemperatureEvent [getTemperature()=" + getTemperature() + ", getMachineName()=" + getMachineName() + "]"; } } Now we can define the event source as shown follows. In Java: StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); DataStream<TemperatureEvent> inputEventStream = env.fromElements(new TemperatureEvent("xyz", 22.0), new TemperatureEvent("xyz", 20.1), new TemperatureEvent("xyz", 21.1), new TemperatureEvent("xyz", 22.2), new TemperatureEvent("xyz", 22.1), new TemperatureEvent("xyz", 22.3), new TemperatureEvent("xyz", 22.1), new TemperatureEvent("xyz", 22.4), new TemperatureEvent("xyz", 22.7), new TemperatureEvent("xyz", 27.0)); In Scala: val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val input: DataStream[TemperatureEvent] = env.fromElements(new TemperatureEvent("xyz", 22.0), new TemperatureEvent("xyz", 20.1), new TemperatureEvent("xyz", 21.1), new TemperatureEvent("xyz", 22.2), new TemperatureEvent("xyz", 22.1), new TemperatureEvent("xyz", 22.3), new TemperatureEvent("xyz", 22.1), new TemperatureEvent("xyz", 22.4), new TemperatureEvent("xyz", 22.7), new TemperatureEvent("xyz", 27.0)) Pattern API Pattern API allows you to define complex event patterns very easily. Each pattern consists of multiple states. To go from one state to another state, generally we need to define the conditions. The conditions could be continuity or filtered out events. Let's try to understand each pattern operation in detail. Begin The initial state can be defined as follows: In Java: Pattern<Event, ?> start = Pattern.<Event>begin("start"); In Scala: val start : Pattern[Event, _] = Pattern.begin("start") Filter We can also specify the filter condition for the initial state: In Java: start.where(new FilterFunction<Event>() { @Override public boolean filter(Event value) { return ... // condition } }); In Scala: start.where(event => ... /* condition */) Subtype We can also filter out events based on their sub-types, using the subtype() method. In Java: start.subtype(SubEvent.class).where(new FilterFunction<SubEvent>() { @Override public boolean filter(SubEvent value) { return ... // condition } }); In Scala: start.subtype(classOf[SubEvent]).where(subEvent => ... /* condition */) Or Pattern API also allows us define multiple conditions together. We can use OR and AND operators. In Java: pattern.where(new FilterFunction<Event>() { @Override public boolean filter(Event value) { return ... // condition } }).or(new FilterFunction<Event>() { @Override public boolean filter(Event value) { return ... // or condition } }); In Scala: pattern.where(event => ... /* condition */).or(event => ... /* or condition */) Continuity As stated earlier, we do not always need to filter out events. There can always be some pattern where we need continuity instead of filters. Continuity can be of two types – strict continuity and non-strict continuity. Strict continuity Strict continuity needs two events to succeed directly which means there should be no other event in between. This pattern can be defined by next(). In Java: Pattern<Event, ?> strictNext = start.next("middle"); In Scala: val strictNext: Pattern[Event, _] = start.next("middle") Non-strict continuity Non-strict continuity can be stated as other events are allowed to be in between the specific two events. This pattern can be defined by followedBy(). In Java: Pattern<Event, ?> nonStrictNext = start.followedBy("middle"); In Scala: val nonStrictNext : Pattern[Event, _] = start.followedBy("middle") Within Pattern API also allows us to do pattern matching based on time intervals. We can define a time-based temporal constraint as follows. In Java: next.within(Time.seconds(30)); In Scala: next.within(Time.seconds(10)) Detecting patterns To detect the patterns against the stream of events, we need run the stream though the pattern. The CEP.pattern() returns PatternStream. The following code snippet shows how we can detect a pattern. First the pattern is defined to check if temperature value is greater than 26.0 degrees in 10 seconds. In Java: Pattern<TemperatureEvent, ?> warningPattern = Pattern.<TemperatureEvent> begin("first") .subtype(TemperatureEvent.class).where(new FilterFunction<TemperatureEvent>() { public boolean filter(TemperatureEvent value) { if (value.getTemperature() >= 26.0) { return true; } return false; } }).within(Time.seconds(10)); PatternStream<TemperatureEvent> patternStream = CEP.pattern(inputEventStream, warningPattern); In Scala: val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment val input = // data val pattern: Pattern[TempEvent, _] = Pattern.begin("start").where(event => event.temp >= 26.0) val patternStream: PatternStream[TempEvent] = CEP.pattern(input, pattern) Use case – complex event processing on temperature sensor In earlier sections, we learnt various features provided by the Flink CEP engine. Now it's time to understand how we can use it in real-world solutions. For that let's assume we work for a mechanical company which produces some products. In the product factory, there is a need to constantly monitor certain machines. The factory has already set up the sensors which keep on sending the temperature of the machines at a given time. Now we will be setting up a system that constantly monitors the temperature value and generates an alert if the temperature exceeds a certain value. We can use the following architecture: Here we will be using Kafka to collect events from sensors. In order to write a Java application, we first need to create a Maven project and add the following dependency: <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-cep-scala_2.10 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-cep-scala_2.10</artifactId> <version>1.1.2</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-java_2.10 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java_2.10</artifactId> <version>1.1.2</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-scala_2.10 --> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-scala_2.10</artifactId> <version>1.1.2</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-connector-kafka-0.9_2.10</artifactId> <version>1.0.0</version> </dependency> Next we need to do following things for using Kafka. First we need to define a custom Kafka deserializer. This will read bytes from a Kafka topic and convert it into TemperatureEvent. The following is the code to do this. EventDeserializationSchema.java: package com.demo.chapter05; import java.io.IOException; import java.nio.charset.StandardCharsets; import org.apache.flink.api.common.typeinfo.TypeInformation; import org.apache.flink.api.java.typeutils.TypeExtractor; import org.apache.flink.streaming.util.serialization.DeserializationSchema; public class EventDeserializationSchema implements DeserializationSchema<TemperatureEvent> { public TypeInformation<TemperatureEvent> getProducedType() { return TypeExtractor.getForClass(TemperatureEvent.class); } public TemperatureEvent deserialize(byte[] arg0) throws IOException { String str = new String(arg0, StandardCharsets.UTF_8); String[] parts = str.split("="); return new TemperatureEvent(parts[0], Double.parseDouble(parts[1])); } public boolean isEndOfStream(TemperatureEvent arg0) { return false; } } Next we create topics in Kafka called temperature: bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic temperature Now we move to Java code which would listen to these events in Flink streams: StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); Properties properties = new Properties(); properties.setProperty("bootstrap.servers", "localhost:9092"); properties.setProperty("group.id", "test"); DataStream<TemperatureEvent> inputEventStream = env.addSource( new FlinkKafkaConsumer09<TemperatureEvent>("temperature", new EventDeserializationSchema(), properties)); Next we will define the pattern to check if the temperature is greater than 26.0 degrees Celsius within 10 seconds: Pattern<TemperatureEvent, ?> warningPattern = Pattern.<TemperatureEvent> begin("first").subtype(TemperatureEvent.class).where(new FilterFunction<TemperatureEvent>() { private static final long serialVersionUID = 1L; public boolean filter(TemperatureEvent value) { if (value.getTemperature() >= 26.0) { return true; } return false; } }).within(Time.seconds(10)); Next match this pattern with the stream of events and select the event. We will also add up the alert messages into results stream as shown here: DataStream<Alert> patternStream = CEP.pattern(inputEventStream, warningPattern) .select(new PatternSelectFunction<TemperatureEvent, Alert>() { private static final long serialVersionUID = 1L; public Alert select(Map<String, TemperatureEvent> event) throws Exception { return new Alert("Temperature Rise Detected:" + event.get("first").getTemperature() + " on machine name:" + event.get("first").getMachineName()); } }); In order to know the alerts generated, we will print the results: patternStream.print(); And we execute the stream: env.execute("CEP on Temperature Sensor"); Now we are all set to execute the application. So as and when we get messages in Kafka topics, the CEP will keep on executing. The actual execution will looks like the following. Example input: xyz=21.0 xyz=30.0 LogShaft=29.3 Boiler=23.1 Boiler=24.2 Boiler=27.0 Boiler=29.0 Example output: Connected to JobManager at Actor[akka://flink/user/jobmanager_1#1010488393] 10/09/2016 18:15:55 Job execution switched to status RUNNING. 10/09/2016 18:15:55 Source: Custom Source(1/4) switched to SCHEDULED 10/09/2016 18:15:55 Source: Custom Source(1/4) switched to DEPLOYING 10/09/2016 18:15:55 Source: Custom Source(2/4) switched to SCHEDULED 10/09/2016 18:15:55 Source: Custom Source(2/4) switched to DEPLOYING 10/09/2016 18:15:55 Source: Custom Source(3/4) switched to SCHEDULED 10/09/2016 18:15:55 Source: Custom Source(3/4) switched to DEPLOYING 10/09/2016 18:15:55 Source: Custom Source(4/4) switched to SCHEDULED 10/09/2016 18:15:55 Source: Custom Source(4/4) switched to DEPLOYING 10/09/2016 18:15:55 CEPPatternOperator(1/1) switched to SCHEDULED 10/09/2016 18:15:55 CEPPatternOperator(1/1) switched to DEPLOYING 10/09/2016 18:15:55 Map -> Sink: Unnamed(1/4) switched to SCHEDULED 10/09/2016 18:15:55 Map -> Sink: Unnamed(1/4) switched to DEPLOYING 10/09/2016 18:15:55 Map -> Sink: Unnamed(2/4) switched to SCHEDULED 10/09/2016 18:15:55 Map -> Sink: Unnamed(2/4) switched to DEPLOYING 10/09/2016 18:15:55 Map -> Sink: Unnamed(3/4) switched to SCHEDULED 10/09/2016 18:15:55 Map -> Sink: Unnamed(3/4) switched to DEPLOYING 10/09/2016 18:15:55 Map -> Sink: Unnamed(4/4) switched to SCHEDULED 10/09/2016 18:15:55 Map -> Sink: Unnamed(4/4) switched to DEPLOYING 10/09/2016 18:15:55 Source: Custom Source(2/4) switched to RUNNING 10/09/2016 18:15:55 Source: Custom Source(3/4) switched to RUNNING 10/09/2016 18:15:55 Map -> Sink: Unnamed(1/4) switched to RUNNING 10/09/2016 18:15:55 Map -> Sink: Unnamed(2/4) switched to RUNNING 10/09/2016 18:15:55 Map -> Sink: Unnamed(3/4) switched to RUNNING 10/09/2016 18:15:55 Source: Custom Source(4/4) switched to RUNNING 10/09/2016 18:15:55 Source: Custom Source(1/4) switched to RUNNING 10/09/2016 18:15:55 CEPPatternOperator(1/1) switched to RUNNING 10/09/2016 18:15:55 Map -> Sink: Unnamed(4/4) switched to RUNNING 1> Alert [message=Temperature Rise Detected:30.0 on machine name:xyz] 2> Alert [message=Temperature Rise Detected:29.3 on machine name:LogShaft] 3> Alert [message=Temperature Rise Detected:27.0 on machine name:Boiler] 4> Alert [message=Temperature Rise Detected:29.0 on machine name:Boiler] We can also configure a mail client and use some external web hook to send e-mail or messenger notifications. The code for the application can be found on GitHub: https://github.com/deshpandetanmay/mastering-flink. Summary We learnt about complex event processing (CEP). We discussed the challenges involved and how we can use the Flink CEP library to solve CEP problems. We also learnt about Pattern API and the various operators we can use to define the pattern. In the final section, we tried to connect the dots and see one complete use case. With some changes, this setup can be used as it is present in various other domains as well. We will see how to use Flink's built-in Machine Learning library to solve complex problems. Resources for Article: Further resources on this subject: Getting Started with Apache Spark DataFrames [article] Getting Started with Apache Hadoop and Apache Spark [article] Integrating Scala, Groovy, and Flex Development with Apache Maven [article]
Read more
  • 0
  • 0
  • 7037

article-image-sap-businessobjects-customizing-dashboard
Packt
27 May 2011
4 min read
Save for later

SAP BusinessObjects: Customizing the Dashboard

Packt
27 May 2011
4 min read
SAP BusinessObjects Dashboards 4.0 Cookbook Over 90 simple and incredibly effective recipes for transforming your business data into exciting dashboards with SAP BusinessObjects Dashboards 4.0 Xcelsius Introduction In this article, we will go through certain techniques on how you can utilize the different cosmetic features Dashboard Design provides, in order to improve the look of your dashboard. Dashboard Design provides a powerful way to capture the audience versus other dashboard tools. It allows developers to build dashboards with the important 'wow' factor that other tools lack. Let's take, for example, two dashboards that have the exact same functionality, placement of charts, and others. However, one dashboard looks much more attractive than the other. In general, people looking at the nicer looking dashboard will be more interested and thus get more value of the data that comes out of it. Thus, not only does Dashboard Design provide a powerful and flexible way of presenting data, but it also provides the 'wow' factor to capture a user's interest. Changing the look of a chart This recipe will run through changing the look of a chart. Particularly, it will go through each tab in the appearance icon of the chart properties. We will then make modifications and see the resulting changes. Getting ready Insert a chart object onto the canvas. Prepare some data and bind it to the chart. How to do it... Double-click/right-click on the chart object on the canvas/object properties window to go into Chart Properties. In the Layout tab, uncheck Show Chart Background. (Move the mouse over the image to enlarge.) In the Series tab, click on the colored square box circled in the next screenshot to change the color of the bar to your desired color. Then change the width of each bar; click on the Marker Size area and change it to 35. Click on the colored boxes circled in red in the Axes tab and choose dark blue to modify the horizontal and vertical axes separately. Uncheck Show Minor Gridlines at the bottom so that we remove all the horizontal lines in between each of the major gridlines. Next, go to the Text and Color tabs, where you can make changes to all the different text areas of the chart. How it works... As you can see, the default chart looks plain and the bars are skinny so it's harder to visualize things. It is a good idea to remove the chart background if there is an underlying background so that the chart blends in better. In addition, the changes to the chart colors and text provide additional aesthetics that help improve the look of the chart. Adding a background to your dashboard This recipe shows the usefulness of backgrounds in the dashboard. It will show how backgrounds can help provide additional depth to objects and help to group certain areas together for better visualization. Getting ready Make sure you have all your objects such as charts and selectors ready on the canvas. Here's an example of the two charts before the makeover. Bind some data to the charts if you want to change the coloring of the series How to do it... Choose Background4 from the Art and Backgrounds tab of the Components window. Stretch the background so that it fills the size of the canvas. Make sure that ordering of the backgrounds is before the charts. To change the ordering of the background, go to the object browser, select the background object and then press the "-" key until the background object is behind the chart. Select Background1 from the Art and Backgrounds tab and put two of them under the charts, as shown in the following screenshot: When the backgrounds are in the proper place, open the properties window for the backgrounds and set the background color to your desired color. In this example we picked turquoise blue for each background. How it works... As you can see with the before and after pictures, having backgrounds can make a huge difference in terms of aesthetics. The objects are much more pleasant to look at now and there is certainly a lot of depth with the charts. The best way to choose the right backgrounds that fit your dashboard is to play around with the different background objects and their colors. If you are not very artistic, you can come up with a bunch of examples and demonstrate it to the business user to see which one they prefer the most.  
Read more
  • 0
  • 0
  • 7021

article-image-the-eu-commission-introduces-guidelines-for-achieving-a-trustworthy-ai
Savia Lobo
09 Apr 2019
4 min read
Save for later

The EU commission introduces guidelines for achieving a ‘Trustworthy AI’

Savia Lobo
09 Apr 2019
4 min read
On the third day of the Digital Day 2019 held in Brussels, the European Commission introduced a set of essential guidelines for building a trustworthy AI, which will guide companies and government to build ethical AI applications. By introducing these new guidelines, the commission is working towards a three-step approach including, Setting out the key requirements for trustworthy AI Launching a large scale pilot phase for feedback from stakeholders Working on international consensus building for human-centric AI EU’s high-level expert group on AI, which consists of 52 independent experts representing academia, industry, and civil society, came up with seven requirements, which according to them, the future AI systems should meet. Seven guidelines for achieving an ethical AI Human agency and oversight: AI systems should enable equitable societies by supporting human agency and fundamental rights, and not decrease, limit or misguide human autonomy. Robustness and safety: A trustworthy AI requires algorithms to be secure, reliable and robust enough to deal with errors or inconsistencies during all life cycle phases of AI systems. Privacy and data governance: Citizens should have full control over their own data, while data concerning them will not be used to harm or discriminate against them. Transparency: The traceability of AI systems should be ensured. Diversity, non-discrimination, and fairness: AI systems should consider the whole range of human abilities, skills and requirements, and ensure accessibility. Societal and environmental well-being: AI systems should be used to enhance positive social change and enhance sustainability and ecological responsibility. Accountability: Mechanisms should be put in place to ensure responsibility and accountability for AI systems and their outcomes. According to EU’s official press release, “Following the pilot phase, in early 2020, the AI expert group will review the assessment lists for the key requirements, building on the feedback received. Building on this review, the Commission will evaluate the outcome and propose any next steps.” The plans fall under the Commission’s AI strategy of April 2018, which “aims at increasing public and private investments to at least €20 billion annually over the next decade, making more data available, fostering talent and ensuring trust ”, the press release states. Andrus Ansip, Vice-President for the Digital Single Market, said, “The ethical dimension of AI is not a luxury feature or an add-on. It is only with trust that our society can fully benefit from technologies. Ethical AI is a win-win proposition that can become a competitive advantage for Europe: being a leader of human-centric AI that people can trust.” Mariya Gabriel, Commissioner for Digital Economy and Society, said, “We now have a solid foundation based on EU values and following an extensive and constructive engagement from many stakeholders including businesses, academia and civil society. We will now put these requirements to practice and at the same time foster an international discussion on human-centric AI." Thomas Metzinger, a Professor of Theoretical Philosophy at the University of Mainz and who was also a member of the commission's expert group that has worked on the guidelines has put forward an article titled, ‘Ethics washing made in Europe’. Metzinger said he has worked on the Ethics Guidelines for nine months. “The result is a compromise of which I am not proud, but which is nevertheless the best in the world on the subject. The United States and China have nothing comparable. How does it fit together?”, he writes. Eline Chivot, a senior policy analyst at the Center for Data Innovation think tank, told The Verge, “We are skeptical of the approach being taken, the idea that by creating a golden standard for ethical AI it will confirm the EU’s place in global AI development. To be a leader in ethical AI you first have to lead in AI itself.” To know more about this news in detail, read the EU press release. Is Google trying to ethics-wash its decisions with its new Advanced Tech External Advisory Council? IEEE Standards Association releases ethics guidelines for automation and intelligent systems Sir Tim Berners-Lee on digital ethics and socio-technical systems at ICDPPC 2018
Read more
  • 0
  • 0
  • 7009

article-image-setting-ireport-pages
Packt
29 Mar 2010
2 min read
Save for later

Setting Up the iReport Pages

Packt
29 Mar 2010
2 min read
Configuring the page format We can follow the listed steps for setting up report pages: Open the report List of Products. Go to menu Window | Report Inspector. The following window will appear on the left side of the report designer: Select the report List of Products, right-click on it, and choose Page Format…. The Page format… dialog box will appear, select A4 from the Format drop-down list, and select Portrait from the Page orientation section. You can modify the page margins if you need to, or leave it as it is to have the default margins. For our report, you need not change the margins. Press OK. Page size You have seen that there are many preset sizes/formats for the report, such as Custom, Letter, Note, Legal, A0 to A10, B0 to B5, and so on. You will choose the appropriate one based on your requirements. We have chosen A4. If the number of columns is too high to fit in Portrait, then choose the Landscape orientation. If you change the preset sizes, the report elements (title, column heading, fields, or other elements) will not be positioned automatically according to the new page size. You have to position each element manually. So be careful if you decide to change the page size. Configuring properties We can modify the default settings of report properties in the following way: Right-click on List of Products and choose Properties. We can configure many important report properties from the Properties window. You can see that there are many options here. You can change the Report name, Page size, Margins, Columns, and more. We have already learnt about setting up pages, so now our concern is to learn about some of the other (More…) options.
Read more
  • 0
  • 0
  • 7006

article-image-introduction-couchbase
Packt
04 Nov 2015
20 min read
Save for later

Introduction to Couchbase

Packt
04 Nov 2015
20 min read
In this article by Henry Potsangbam, the author of the book Learning Couchbase, we will learn that Couchbase is a NoSQL nonrelational database management system, which is different from traditional relational database management systems in many significant ways. It is designed for distributed data stores in which there are very large-scale data storage requirements (terabytes and petabytes of data). These types of data storing mechanisms might not require fixed schemas, avoid join operations, and typically scale horizontally. The main feature of Couchbase is that it is schemaless. There is no fixed schema to store data. Also, there is no join between one or more data records or documents. It allows distributed storage and utilizes computing resources, such as CPU and RAM, spanning across the  nodes that are part of the Couchbase cluster. Couchbase databases provide the following benefits: It provides a flexible data model. You don't need to worry about the schema. You can design your schema depending on the needs of your application domain and not by storage demands. It's scalable and can be done very easily. Since it's a distributing system, it can scale out horizontally without too many changes in the application. You can scale out with a few mouse clicks and rebalance it very easily. It provides high availability, since there are multiples servers and data replicated across nodes. (For more resources related to this topic, see here.) The architecture of Couchbase Couchbase clusters consist of multiple nodes. A cluster is a collection of one or more instances of the Couchbase server that are configured as a logical cluster. The following is a Couchbase server architecture diagram:  Couchbase Server Architecture As mentioned earlier, while most of the clusters' technologies work on master-slave relationships, Couchbase works on peer-to-peer node mechanism. This means there is no difference between the nodes in the cluster. The functionality provided by each node is the same. Thus, there is no single point of failure. When there is a failure of one node, another node takes up its responsibility, thus providing high availability. The data manager Any operation performed on the Couchbase database system gets stored in the memory, which acts as a caching layer. By default, every document gets stored in the memory for each read, insert, update, and so on until the memory is full. It's a drop-in replacement for Memcache. However, in order to provide persistency of the record, there is a concept called disk queue. This will flush the record to the disk asynchronously, without impacting the client request. This functionality is provided automatically by the data manager, without any human intervention. Cluster management The cluster manager is responsible for node administration and node monitoring within a cluster. Every node within a Couchbase cluster includes the cluster manager component, data storage, and data manager. It manages data storage and retrieval. It contains the memory cache layer, disk persistence mechanism, and query engine. Couchbase clients use the cluster map provided by the cluster manager to find out which node holds the required data, and then communicates with the data manager on that node to perform database operations. Buckets In RDBMS, we usually encapsulate all of the relevant data for a particular application in a database. Say, for example, we are developing an e-commerce application. We usually create a database named, e-commerce, that will be used as the logical namespace to store records in a table, such as customer or shopping cart details. It's called a bucket in a Couchbase terminology. So, whenever you want to store any document in a Couchbase cluster, you will be creating a bucket as a logical namespace as the first step. Precisely, a bucket is an independent virtual container that groups documents logically in a Couchbase cluster, which is equivalent to a database namespace in RDBMS. It can be accessed by various clients in an application. You can also configure features such as security, replication, and so on per bucket. We usually create one database and consolidate all related tables in that namespace in the RDBMS development. Likewise, in Couchbase too, you will usually create one bucket per application and encapsulate all the documents in it. Now, let me explain this concept in detail, since it's the component that administrators and developers will be working with most of the time. In fact, I used to wonder why it is named "bucket". Perhaps, we can store anything in it as we do in the physical world, hence the name "bucket". In any database system, the main purpose is to store data, and the logical namespace for storing data is called a database. Likewise, in Couchbase, the namespace for storing data is called a bucket. So in brief, it's a data container that stores data related to applications, either in the RAM or in disks. In fact, it helps you partition application data depending on an application's requirements. If you are hosting different types of applications in a cluster, say an e-commerce application and a data warehouse, you can partition them using buckets. You can create two buckets, one for the e-commerce application and another for the data warehouse. As a thumb rule, you create one bucket for each application. In an RDBMS, we store data in the forms of rows in a table, which in turn is encapsulated by a database. In Couchbase, bucket is the equivalence of database, but there is no concept of tables in Couchbase. In Couchbase, all data or records, which are referred to as documents, are stored directly in a bucket. Basically, the lowest namespace for storing document or data in Couchabase is a bucket. Internally, Couchbase arranges to store documents in different storages for different buckets. Information such as runtime statistics is collected and reported by the Couchbase cluster, grouped by the bucket type. It enables you to flush out individual buckets. You can create a separate temporary bucket rather than a regular transaction bucket when you need temporary storage for ad hoc requirements, such as reporting, temporary workspace for application programming, and so on, so that you can flush out the temporary bucket after use. The features or capabilities of a bucket depend on its type, which will be discussed subsequently. Types of buckets Couchbase provides two types of buckets, which are differentiated by the mechanism of its storage and capabilities. The two types are: Memcached Couchbase Memcached As the name suggests, buckets of the Memcached type store documents only in the RAM. This means that documents stored in the Memcache bucket are volatile in nature. Hence, such types of buckets won't survive a system reboot. Documents that are stored in such buckets will be accessible by direct address using the key-value pair mechanism. The bucket is distributed, which means that it is spread across the Couchbase cluster nodes. Since it's volatile in nature, you need to be sure of its use cases before using such types of buckets. You can use this kind of bucket to store data that is required temporarily and for better performance, since all of the data is stored in the memory but doesn't require durability. Suppose you need to display a list of countries in your application, then, instead of always fetching from the disk storage, the best way is to fetch data from the disk, populate it in the Memcached bucket, and use it in your application. In the Memcached bucket, the maximum size of a document allowed is 1 MB. All of the data is stored in the RAM, and if the bucket is running out of memory, the oldest data will be discarded. We can't replicate a Memcached bucket. It's completely compatible with the open source Memcached distributed memory object caching system. If you want to know more about the Memcached technology, you can refer to http://memcached.org/. Couchbase The Couchbase bucket type gives persistence to documents. It is distributed across a cluster of nodes and can configure replication, which is not supported in the Memcached bucket type. It's highly available, since documents are replicated across nodes in a cluster. You can verify the bucket using the web Admin UI as follows: Understanding documents By now, you must have understood the concept of buckets, its working and configuration, and so on. Let's now understand the items that get stored in it. So, what is a document? A document is a piece of information or data that gets stored in a bucket. It's the smallest item that can be stored in a bucket. As a developer, you will always be working on a bucket, in terms of documents. Documents are similar to rows in the RDBMS table schema; but in NoSQL terminologies, it will be referred to as a document. It's a way of thinking and designing data objects. All information and data should get stored as a document as it's represented in a physical document. All NoSQL databases, including Couchbase don't require a fixed schema to store documents or data in a particular bucket. These documents are represented in the form of JSON. For the time being, let's try to understand the document at a basic level. Let me show you how a document in represented in Couchbase for better clarity. You need to install the beer-sample bucket for this, which comes along with the Couchbase software installation. If you did not install it earlier, you can do it from the web console using the Settings button. The document overview The preceding screenshot shows a document, it represents a brewery and its document ID is 21st_amendment_brewery_cafe. Each document can have multiple properties/items along with its values. For example, name is the property and 21st Amendment Brewery Café is the value of the name property. So, what is this document ID? The document ID is a unique identifier that is assigned for each document in a bucket. You need to assign a unique ID whenever a document gets stored in a bucket. It's just like a primary key of a table in RDBMS. Keys and metadata As described earlier, a document key is a unique identifier for a document. The value of a document key can be any string. In addition to the key, documents usually have three more types of metadata, which are provided by the Couchbase server, unless modified by an application developer. They are as follows: rev: This is an internal revision ID meant for internal use by Couchbase. It should not be used in the application. expiration: If you want your document to expire after a certain amount of time, you can set that value here. By default, it is 0, that is, the document never expires. flags: These are numerical values specific to the client library that is updated when the document is stored. Document modeling In order to bring agility to applications that change business processes frequently, demanded by its business environment, being schemaless is a good feature. In this methodology, you don't need to be concerned about structures of data initially while designing application.This means as a developer, you don't need to worry about structures of a database schema, such as tables, or worry about splitting information into various tables, instead, you should focus on application requirement and satisfy business needs. I still recollect various moments related to design domain objects/tables, which I've been through when I was a developer, especially when I just graduated from engineering college and was into developing applications for a corporate company. Whenever I was a part of the discussions for any application requirement, at the back of the mind, I had some of these questions: How does a domain object get stored in the database? What will be the table structures? How will I retrieve the domain objects? Will it be difficult to use ORM such as Hibernate, Ejb, and so on? My point here is that instead of being mentally present in the discussion on requirement gathering and understanding the business requirements in detail, I spent more time mapping business entities in a table format. The reason being that if I did not put forward the technical constraints at that time, it would be difficult to revert about the technical challenges we could face in the data structures design later. Earlier, whenever we talked about application design, we always thought about database design structures, such as converting objects into multiple tables using normalization forms (2NF/3NF), and spent a lot of time mapping database objects to application objects using various ORM tools, such as Hibernate, Ejb, and so on. In document modeling, we will always think in terms of application requirements, that is, data or information flow while designing documents, not in terms of storage. We can simply start our application development using business representation of an entity without much concern about the storage structures. Having covered the various advantages provided by a document-based system, we will discuss in this section how to design such kinds of documents to store in any document-based database system, such as Couchbase. Then, we can effectively design domain objects for coherence with the application requirements. Whenever we model the document's structure, we need to consider two main points, one is to store all information in one document and the second is to break it down into multiple documents. You need to consider these and choose one keeping the application requirement in mind. So, an important factor is to evaluate whether the information contains unrelated data components that are independent and can be broken up into different documents or all components represent a complete domain object that could be accessed together most of the time. If data components in an information are related and will be required most of the time, together in a business logic, consider grouping them as a single logical container so that the application developer won't perceive as separate objects or documents. All of these factors depend on the nature of the application being developed and its use cases. Besides these, you need to think in terms of accessing information, such as atomicity, single unit of access, and so on. You can ask yourself a question such as, "Are we going to create or modify the information as a single unit or not?". We also need to consider concurrency, what will happen when the document is accessed by multiple clients at the same time and so on. After looking at all these considerations that you need to keep in mind while designing a document, you have two options: one is to keep all of the information in a single document, and the other is to have a separate document for every different object type. Couchbase SDK overview We have also discussed some of the guidelines used for designing document-based database system. What if we need to connect and perform operations on the Couchbase cluster in an application? This can be achieved using Couchbase client libraries, which are also collectively known as the Couchbase Software Development Kit (SDK). The Couchbase SDK APIs are language dependent. However, the concept remains the same and is applicable to all languages that are supported by the SDK. Let's now try to understand the Couchbase APIs as a concept without referring to any specific language, and then we will map these concepts to Java APIs in the Java SDK section. Couchbase SDK clients are also known as smart clients since they understand the overall status of the cluster, that is, clustermap, and keep the information of the vBucket and its server nodes updated. There are two types of Couchbase clients, as follows: Smart clients: Such clients can understand the health of the cluster and receive constant updates about the information of the cluster. Each smart client maintains a clustermap that can derive the cluster node where a document is stored using the document ID, for example, Java, .NET, and so on. Memcached-compatible: Such clients are used for applications that would be interacting with the traditional memcached bucket, which is not aware of vBucket. It needs to install Moxi (a memcached proxy) on all clients that require access to the Couchbase memcache bucket, which act as a proxy to convert the API's call to the memcache compatible call. Understanding the write operation in the Couchbase cluster Let's understand how the write operation works in the Couchbase cluster. When a write command is issued using the set operation on the Couchbase cluster, the server immediately responds once the document is written to the memory of that particular node. How do clients know which nodes in the cluster will be responsible for storing the document? You might recall that every operation requires a document ID, using this document ID, the hash algorithm determines the vBucket in which it belongs. Then, this vBucket is used to determine the node that will store the document. All mapping information, vBucket to node, is stored in each of the Couchbase client SDKs, which form the clustermap. Views Whenever we want to extract fields from JSON documents without document ID, we use views. If you want to find a document or fetch information about a document with attributes or fields of a document other than the document ID, a view is the way to go for it. Views are written in the form of MapReduce, which we have discussed earlier, that is, it consists of map and reduce phase. Couchbase implements MapReduce using the JavaScript language. The following diagram shows you how various documents are passed through the View engine to produce an index. The View engine ensures that all documents in the bucket are passed through the map method for processing and subsequently to reduce function to create indexes.   When we write views, the View Engine defines materialized views for JSON documents and then queries across the dataset in the bucket. Couchbase provides a view processor to process the entire documents with map and reduce methods defined by the developer to create views. The views are maintained locally by each node for the documents stored in that particular node. Views are created for documents that are stored on the disk only. A view's life cycle A view has its own life cycle. You need to define, build, and query it, as shown in this diagram:   View life cycle  Initially, you will define the logic of MapReduce and build it on each node for each document that is stored locally. In the build phase, we usually emit those attributes that need to be part of indexes. Views usually work on JSON documents only. If documents are not in the JSON format or the attributes that we emit in the map function are not part of the document, then the document is ignored during the generation of views by the view engine. Finally, views are queried by clients to retrieve and find documents. After the completion of this cycle, you can still change the definition of MapReduce. For that, you need to bring the view to development mode and modify it. Thus, you have the view cycle as shown in the preceding diagram while developing a view.   The preceding code shows a view. A view has predefined syntax. You can't change the method signature. Here, it follows the functional programming syntax. The preceding code shows a map method that accepts two parameters: doc: This represents the entire document meta: This represents the metadata of the document Each map will return some objects in the form of key and value. This is represented by the emit() method. The emit() method returns key and value. However, value will usually be null. Since, we can retrieve a document using the key, it's better to use that instead of using the value field of the emit() method. Custom reduce functions Why do we need custom reduce functions? Sometimes, the built-in reduce function doesn’t meet our requirements, although it will suffice most of the time. Custom reduce functions allow you to create your own reduce function. In such a reduce function, the output of map function goes to the corresponding reduce function group as per the key of the map output and the group level parameter. Couchbase ensures that output from the map will be grouped by key and supplied to reduce. Then it’s the developer's role to define logic in reduce, what to perform on the data such as aggregating, addition, and so on. To handle the incremental MapReduce functionality (that is, updating an existing view), each function must also be able to handle and consume its own output. In an incremental situation, the function must handle both new records and previously computed reductions. The input to the reduce function can be not only raw data from the map phase, but also the output of a previous reduce phase. This is called re-reduce and can be identified by the third argument of reduce(). When the re-reduce argument is false, both the key and value arguments are arrays, the value argument array matches the corresponding element with that of array of key. For example, the key[1] is the key of value[1]. The map to reduce function execution is shown as follows: Map reduce execution in a view N1QL overview So far, you have learned how to fetch documents in two ways: using document ID and views. The third way of retrieving documents is by using N1QL, pronounced as Nickel. Personally, I feel that it is a great move by Couchbase to provided SQL-like syntax, since most engineers and IT professionals are quite familiar with SQL, which is usually part of their formal education. It brings confidence in them and also provides ease of using Couchbase in their applications. Moreover, it provides most database operational activities related to development. N1QL can be used to: Store documents, that is, the INSERT command Fetch documents, that is, the SELECT command Prior to the advent of N1QL, developers needed to perform key-based operations, which was quite complex when it came to retrieving information using views and custom reduce. With the previously available options, developers needed to know the key before performing any operation on the document, which would not be the case all the time. Before N1QL features were incorporated in Couchbase, you could not perform ad hoc queries on documents in a bucket until you created views on it. Moreover, sometimes we need to perform joins or searches in the bucket, which is not possible using the document ID and views. All of these drawbacks are addressed in N1QL. I would rather say that N1QL features as an evolution in the Couchbase history. Understanding the N1QL syntax Most N1QL queries will be in the following format: SELECT [DISTINCT] <expression> FROM <data source> WHERE <expression> GROUP BY <expression> ORDER BY <expression> LIMIT <number> OFFSET <number> The preceding statement is very generic. It tells you the comprehensive options provided by N1QL in one syntax: SELECT * FROM LearningCouchbase This selects the entire document store in the bucket, LearningCouchbase. Here, we have fetched all the documents in the LearningCouchbase bucket. The output of the query is shown here; it is in the JSON document format only. All documents returned by the N1QL query will be in the array values format of the attribute, resultset. Summary You learned how to design a document base data schema and connect using connection polling from a Java base application to Couchbase. You also understood how to retrieve data from it using MapReduce based views, and you understood SQL such as syntax, N1QL to extract documents from the Couchbase database, and bucket and perform high available features with XDCR. It will also enable you to perform a full text search by integrating Elasticsearch plugins. Resources for Article: Further resources on this subject: MAPREDUCE FUNCTIONS [article] PUTTING YOUR DATABASE AT THE HEART OF AZURE SOLUTIONS [article] MOVING SPATIAL DATA FROM ONE FORMAT TO ANOTHER [article]
Read more
  • 0
  • 0
  • 6982
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-sql-server-integration-services-ssis
Packt
03 Sep 2013
5 min read
Save for later

SQL Server Integration Services (SSIS)

Packt
03 Sep 2013
5 min read
(For more resources related to this topic, see here.) SSIS as an ETL – extract, transform, and load tool The primary objective of an ETL tool is to be able to import and export data to and from heterogeneous data sources. This includes the ability to connect to external systems, as well as to transform or clean the data while moving the data between the external systems and the databases. SSIS can be used to import data to and from SQL Server. It can even be used to move data between external non-SQL systems without requiring SQL server to be the source or the destination. For instance, SSIS can be used to move data from an FTP server to a local flat file. SSIS also provides a workflow engine for automation of the different tasks (for example, data flows, tasks executions, and so on.) that are executed in an ETL job. An SSIS package execution can itself be one step that is part of an SQL Agent job, and SQL Agent can run multiple jobs independent of each other. An SSIS solution consists of one or more package, each containing a control flow to perform a sequence of tasks. Tasks in a control flow can include calls to web services, FTP operations, file system tasks, automation of command line commands, and others. In particular, a control flow usually includes one or more data flow tasks, which encapsulate an in-memory, buffer-based pipeline of data from a source to a destination, with transformations applied to the data as it flows through the pipeline. An SSIS package has one control flow, and as many data flows as necessary. Data flow execution is dictated by the content of the control flow. A detailed discussion on SSIS and its components are outside the scope of this article and it assumes that you are familiar with the basic SSIS package development using Business Intelligence Development Studio (SQL Server 2005/2008/2008 R2) or SQL Server Data Tools (SQL Server 2012). If you are a beginner in SSIS, it is highly recommended to read from a bunch of good SSIS books available as a prerequisite. In the rest of this article, we will focus on how to consume Hive data from SSIS using the Hive ODBC driver. The prerequisites to develop the package shown in this article are SQL Server Data Tools, (which comes as a part of SQL Server 2012 Client Tools and Components) and the 32-bit Hive ODBC Driver installed. You will also need your Hadoop cluster up with Hive running on it. Developing the package SQL Server Data Tools (SSDT) is the integrated development environment available from Microsoft to design, deploy, and develop SSIS packages. SSDT is installed when you choose to install SQL Server Client tools and Workstation Components from your SQL Server installation media. SSDT supports creation of Integration Services, Analysis Services, and Reporting Services projects. Here, we will focus on Integration Services project type. Creating the project Launch SQL Server Data Tools from SQL Server 2012 Program folders as shown in the following screenshot: Create a new Project and choose Integration Services Project in the New Project dialog as shown in the following screenshot: This should create the SSIS project with a blank Package.dtsx inside it visible in the Solution Explorer window of the project as shown in the following screenshot: Creating the Data Flow A Data Flow is a SSIS package component, which consists of the sources and destinations that extract and load data, the transformations that modify and extend data, and the paths that link sources, transformations, and destinations. Before you can add a data flow to a package, the package control flow must include a Data Flow task. The Data Flow task is the executable within the SSIS package, which creates, orders, and runs the data flow. A separate instance of the data flow engine is opened for each Data Flow task in a package. To create a Data Flow task, perform the following steps: Double-click (or drag-and-drop) on a Data Flow Task from the toolbox in the left. This should place a Data Flow Task in the Control Flow canvas of the package as in the following screenshot: Double-click on the Data Flow Task or click on the Data Flow tab in SSDT to edit the task and design the source and destination components as in the following screenshot: Creating the source Hive connection The first thing we need to do is create a connection manager that will connect to our Hive data tables hosted in the Hadoop cluster. We will use an ADO.NET connection, which will use the DSN HadoopOnLinux we created earlier to connect to Hive. To create the connection, perform the following steps: Right-click on the Connection Managers section in the project and click on New ADO.Net Connection... as shown in the following screenshot: From the list of providers, navigate to .Net Providers | ODBC Data Provider and click on OK in the Connection Manager window as shown in the following screenshot: Select the HadoopOnLinux DSN from the Data Sources list. Provide the Hadoop cluster credentials and test connection should succeed as shown in the following screenshot: Summary In this way we learned how to create an SQL Server Integration Services package to move data from Hadoop to SQL Server using the Hive ODBC driver. Resources for Article: Further resources on this subject: Microsoft SQL Azure Tools [Article] Connecting to Microsoft SQL Server Compact 3.5 with Visual Studio [Article] Getting Started with SQL Developer: Part 1 [Article]
Read more
  • 0
  • 0
  • 6962

article-image-mine-popular-trends-github-python-part-2
Amey Varangaonkar
27 Dec 2017
1 min read
Save for later

How to Mine Popular Trends on GitHub using Python - Part 2

Amey Varangaonkar
27 Dec 2017
1 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. In this book, you will find widely used social media mining techniques for extracting useful insights to drive your business.[/box] In Part 1 of this series, we gathered the GitHub data for analysis. Here, we will analyze that data as per our requirements, to get interesting insights on the highest trending and popular tools and languages on GitHub. We have seen so far that the GitHub API provides interesting sets of information about the code repositories and metadata around the activity of its users around these repositories. In the following sections, we will analyze this data to find out which are the most popular repositories through the analysis of its descriptions and then drilling down to the watchers, forks, and issues submitted on the emerging technologies. Since, technology is evolving so rapidly, this approach could help us to stay on top of the latest trending technologies. In order to find out what are the trending technologies, we will perform the analysis in a few steps: Identifying top technologies First of all, we will use text analytics techniques to identify what are the most popular phrases related to technologies in repositories from 2017. Our analysis will be focused on the most frequent bigrams. We import a nltk.collocation module which implements n-gram search tools: import nltk from nltk.collocations import * Then, we convert the clean description column into a list of tokens: list_documents = df['clean'].apply(lambda x: x.split()).tolist() As we perform an analysis on documents, we will use the method from_documents instead of a default one from_words. The difference between these two methods lies in then input data format. The one used in our case takes as argument a list of tokens and searches for n-grams document-wise instead of corpus-wise. It protects against detecting bi-grams composed of the last word of one document and the first one of another one: bigram_measures = nltk.collocations.BigramAssocMeasures() bigram_finder = BigramCollocationFinder.from_documents(list_documents) We take into account only bi-grams which appear at least three times in our document set: bigram_finder.apply_freq_filter(3) We can use different association measures to find the best bi-grams, such as raw frequency, pmi, student t, or chi sq. We will mostly be interested in the raw frequency measure, which is the simplest and most convenient indicator in our case. We get top 20 bigrams according to raw_freq measure: bigrams = bigram_finder.nbest(bigram_measures.raw_freq,20) We can also obtain their scores by applying the score_ngrams method: scores = bigram_finder.score_ngrams(bigram_measures.raw_freq) All the other measures are implemented as methods of BigramCollocationFinder. To try them, you can replace raw_freq by, respectively, pmi, student_t, and chi_sq. However, to create a visualization we will need the actual number of occurrences instead of scores. We create a list by using the ngram_fd.items() method and we sort it in descending order. ngram = list(bigram_finder.ngram_fd.items()) ngram.sort(key=lambda item: item[-1], reverse=True) It returns a dictionary of tuples which contain an embedded tuple and its frequency. We transform it into a simple list of tuples where we join bigram tokens: frequency = [(" ".join(k), v) for k,v in ngram] For simplicity reasons we put the frequency list into a dataframe: df=pd.DataFrame(frequency) And then, we plot the top 20 technologies in a bar chart: import matplotlib.pyplot as plt plt.style.use('ggplot') df.set_index([0], inplace = True) df.sort_values(by = [1], ascending = False).head(20).plot(kind = 'barh') plt.title('Trending Technologies') plt.ylabel('Technology') plt.xlabel('Popularity') plt.legend().set_visible(False) plt.axvline(x=14, color='b', label='Average', linestyle='--', linewidth=3) for custom in [0, 10, 14]: plt.text(14.2, custom, "Neural Networks", fontsize = 12, va = 'center', bbox = dict(boxstyle='square', fc='white', ec='none')) plt.show() We've added an additional line which helps us to aggregate all technologies related to neural networks. It is done manually by selecting elements by indices, (0,10,14) in this case. This operation might be useful for interpretation. The preceding analysis provides us with an interesting set of the most popular technologies on GitHub. It includes topics for software engineering, programming languages, and artificial intelligence. An important thing to be noted is that technology around neural networks emerges more than once, notably, deep learning, TensorFlow, and other specific projects. This is not surprising, since neural networks, which are an important component in the field of artificial intelligence, have been spoken about and practiced heavily in the last few years. So, if you're an aspiring programmer interested in AI and machine learning, this is a field to dive into! Programming languages  The next step in our analysis is the comparison of popularity between different programming languages. It will be based on samples of the top 1,000 most popular repositories by year. Firstly, we get the data for last three years: queries = ["created:>2017-01-01", "created:2015-01-01..2015-12-31", "created:2016-01-01..2016-12-31"] We reuse the search_repo_paging function to collect the data from the GitHub API and we concatenate the results to a new dataframe. df = pd.DataFrame() for query in queries: data = search_repo_paging(query) data = pd.io.json.json_normalize(data) df = pd.concat([df, data]) We convert the dataframe to a time series based on the create_at column df['created_at'] = df['created_at'].apply(pd.to_datetime) df = df.set_index(['created_at']) Then, we use aggregation method groupby which restructures the data by language and year, and we count the number of occurrences by language: dx = pd.DataFrame(df.groupby(['language', df.index.year])['language'].count()) We represent the results on a bar chart: fig, ax = plt.subplots() dx.unstack().plot(kind='bar', title = 'Programming Languages per Year', ax= ax) ax.legend(['2015', '2016', '2017'], title = 'Year') plt.show() The preceding graph shows a multitude of programming languages from assembly, C, C#, Java, web, and mobile languages, to modern ones like Python, Ruby, and Scala. Comparing over the three years, we see some interesting trends. We notice HTML, which is the bedrock of all web development, has remained very stable over the last three years. This is not something that will not be replaced in a hurry. Once very popular, Ruby now has a decrease in popularity. The popularity of Python, also our language of choice for this book, is going up. Finally, the cross-device programming language, Swift, initially created by Apple but now open source, is getting extremely popular over time. It could be interesting to see in the next few years, if these trends change or hold true for long. Programming languages used in top technologies Now we know what are the top programming languages and technologies quoted in repositories description. In this section we will try to combine this information and find out what are the main programming languages for each technology. We select four technologies from previous section and print corresponding programming languages. We look up the column containing cleaned repository description and create a set of the languages related to the technology. Using a set will assure that we have unique Values. technologies_list = ['software engineering', 'deep learning', 'open source', 'exercise practice'] for tech in technologies_list: print(tech) print(set(df[df['clean'].str.contains(tech)]['language'])) software engineering {'HTML', 'Java'} deep learning {'Jupyter Notebook', None, 'Python'} open source {None, 'PHP', 'Java', 'TypeScript', 'Go', 'JavaScript', 'Ruby', 'C++'} exercise practice {'CSS', 'JavaScript', 'HTML'} Following the text analysis of the descriptions of the top technologies and then extracting the programming languages for them we notice the following: You can do a lot more analysis with this GitHub data such as: Want to know how? You can check out our book Python Social Media Analytics to get a detailed walkthrough of these topics.
Read more
  • 0
  • 0
  • 6861

article-image-structural-equation-modeling-and-confirmatory-factor-analysis
Packt
06 Feb 2015
30 min read
Save for later

Structural Equation Modeling and Confirmatory Factor Analysis

Packt
06 Feb 2015
30 min read
In this article by Paul Gerrard and Radia M. Johnson, the authors of Mastering Scientific Computation with R, we'll discuss the fundamental ideas underlying structural equation modeling, which are often overlooked in other books discussing structural equation modeling (SEM) in R, and then delve into how SEM is done in R. We will then discuss two R packages, OpenMx and lavaan. We can directly apply our discussion of the linear algebra underlying SEM using OpenMx. Because of this, we will go over OpenMx first. We will then discuss lavaan, which is probably more user friendly because it sweeps the matrices and linear algebra representations under the rug so that they are invisible unless the user really goes looking for them. Both packages continue to be developed and there will always be some features better supported in one of these packages than in the other. (For more resources related to this topic, see here.) SEM model fitting and estimation methods To ultimately find a good solution, software has to use trial and error to come up with an implied covariance matrix that matches the observed covariance matrix as well as possible. The question is what does "as well as possible" mean? The answer to this is that the software must try to minimize some particular criterion, usually some sort of discrepancy function. Just what that criterion is depends on the estimation method used. The most commonly used estimation methods in SEM include: Ordinary least squares (OLS) also called unweighted least squares Generalized least squares (GLS) Maximum likelihood (ML) There are a number of other estimation methods as well, some of which can be done in R, but here we will stick with describing the most common ones. In general, OLS is the simplest and computationally cheapest estimation method. GLS is computationally more demanding, and ML is computationally more intensive. We will see why this is, as we discuss the details of these estimation methods. Any SEM estimation method seeks to estimate model parameters that recreate the observed covariance matrix as well as possible. To evaluate how closely an implied covariance matrix matches an observed covariance matrix, we need a discrepancy function. If we assume multivariate normality of the observed variables, the following function can be used to assess discrepancy: In the preceding figure, R is the observed covariance matrix, C is the implied covariance matrix, and V is a weight matrix. The tr function refers to the trace function, which sums the elements of the main diagonal. The choice of V varies based on the SEM estimation method: For OLS, V = I For GLS, V = R-1 In the case of an ML estimation, we seek to minimize one of a number of similar criteria to describe ML, as follows: In the preceding figure, n is the number of variables. There are a couple of points worth noting here. GLS estimation inverts the observed correlation matrix, something computationally demanding with large matrices, but something that must only be done once. Alternatively, ML requires inversion of the implied covariance matrix, which changes with each iteration. Thus, each iteration requires the computationally demanding step of matrix inversion. With modern fast computers, this difference may not be noticeable, but with large SEM models, this might start to be quite time-consuming. Assessing SEM model fit The final question in an SEM model is how well the model explains the data. This is answered with the use of SEM measures of fit. Most of these measures are based on a chi-squared distribution. The fit criteria for GLS and ML (as well as a number of other estimation procedures such as asymptotic distribution-free methods) multiplied by N-1 is approximately chi-square distributed. Here, the capital N represents the number of observations in the dataset, as opposed to lower case n, which gives the number of variables. We compute degrees of freedom as the difference between the number of estimated parameters and the number of known covariances (that is, the total number of values in one triangle of an observed covariance matrix). This gives way to the first test statistic for SEM models, a chi-squared significance level comparing our chi-square value to some minimum chi-square threshold to achieve statistical significance. As with conventional chi-square testing, a chi-square value that is higher than some minimal threshold will reject the null hypothesis. Most experimental science features such as rejection supports the hypothesis of the experiment. This is not the case in SEM, where the null hypothesis is that the model fits the data. Thus, a non-significant chi-square is an indicator of model fit, whereas a significant chi-square rejects model fit. A notable limitation of this is that a greater sample size, greater N, will increase the chi-square value and will therefore increase the power to reject model fit. Thus, using conventional chi-squared testing will tend to support models developed in small samples and reject models developed in large samples. The choice an interpretation of fit measures is a contentious one in SEM literature. However, as can be seen, chi-square has limitations. As such, other model fit criteria were developed that do not penalize models that fit in large samples (some may penalize models fit to small samples though). There are over a dozen indices, but the most common fit indices and interpretation information are as follows: Comparative fit index: In this index, a higher value is better. Conventionally, a value of greater than 0.9 was considered an indicator of good model fit, but some might argue that a value of at least 0.95 is needed. This is relatively sample size insensitive. Root mean square error of approximation: A value of under 0.08 (smaller is better) is often considered necessary to achieve model fit. However, this fit measure is quite sample size sensitive, penalizing small sample studies. Tucker-Lewis index (Non-normed fit index): This is interpreted in a similar manner as the comparative fit index. Also, this is not very sample size sensitive. Standardized root mean square residual: In this index, a lower value is better. A value of 0.06 or less is considered needed for model fit. Also, this may penalize small samples. In the next section, we will show you how to actually fit SEM models in R and how to evaluate fit using fit measures. Using OpenMx and matrix specification of an SEM We went through the basic principles of SEM and discussed the basic computational approach by which this can be achieved. SEM remains an active area of research (with an entire journal devoted to it, Structural Equation Modeling), so there are many additional peculiarities, but rather than delving into all of them, we will start by delving into actually fitting an SEM model in R. OpenMx is not in the CRAN repository, but it is easily obtainable from the OpenMx website, by typing the following in R: source('http://openmx.psyc.virginia.edu/getOpenMx.R')" Summarizing the OpenMx approach In this example, we will use OpenMx by specifying matrices as mentioned earlier. To fit an OpenMx model, we need to first specify the model and then tell the software to attempt to fit the model. Model specification involves four components: Specifying the model matrices; this has two parts: Declare starting values for the estimation Declaring which values can be estimated and which are fixed Telling OpenMx the algebraic relationship of the matrices that should produce an implied covariance matrix Giving an instruction for the model fitting criterion Providing a source of data The R commands that correspond to each of these steps are: mxMatrix mxAlgebra mxMLObjective mxData We will then pass the objects created with each of these commands to create an SEM model using mxModel. Explaining an entire example First, to make things simple, we will store the FALSE and TRUE logical values in single letter variables, which will be convenient when we have matrices full of TRUE and FALSE values as follows: F <- FALSE T <- TRUE Specifying the model matrices Specifying matrices is done with the mxMatrix function, which returns an MxMatrix object. (Note that the object starts with a capital "M" while the function starts with a lowercase "m.") Specifying an MxMatrix is much like specifying a regular R matrix, but MxMatrices has some additional components. The most notable difference is that there are actually two different matrices used to create an MxMatrix. The first is a matrix of starting values, and the second is a matrix that tells which starting values are free to be estimated and which are not. If a starting value is not freely estimable, then it is a fixed constant. Since the actual starting values that we choose do not really matter too much in this case, we will just pick one as a starting value for all parameters that we would like to be estimated. Let's take a look at the following example: mx.A <- mxMatrix( type = "Full", nrow=14, ncol=14, #Provide the Starting Values values = c(    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0 ), #Tell R which values are free to be estimated    free = c(    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, T, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, F, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, F, F,    F, F, F, F, F, F, F, F, F, F, F, T, T, F ), byrow=TRUE,   #Provide a matrix name that will be used in model fitting name="A", ) We will now apply this same technique to the S matrix. Here, we will create two S matrices, S1 and S2. They differ simply in the starting values that they supply. We will later try to fit an SEM model using one matrix, and then the other to address problems with the first one. The difference is that S1 uses starting variances of 1 in the diagonal, and S2 uses starting variances of 5. Here, we will use the "symm" matrix type, which is a symmetric matrix. We could use the "full" matrix type, but by using "symm", we are saved from typing all of the symmetric values in the upper half of the matrix. Let's take a look at the following matrix: mx.S1 <- mxMatrix("Symm", nrow=14, ncol=14, values = c(    1,    0, 1,    0, 0, 1,    0, 1, 0, 1,    1, 0, 0, 0, 1,    0, 1, 0, 0, 0, 1,    0, 0, 1, 0, 0, 0, 1,    0, 0, 0, 1, 0, 1, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ),      free = c(    T,    F, T,    F, F, T,    F, T, F, T,    T, F, F, F, T,    F, T, F, F, F, T,    F, F, T, F, F, F, T,    F, F, F, T, F, T, F, T,    F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T ), byrow=TRUE, name="S" )   #The alternative, S2 matrix: mx.S2 <- mxMatrix("Symm", nrow=14, ncol=14, values = c(    5,    0, 5,    0, 0, 5,    0, 1, 0, 5,    1, 0, 0, 0, 5,    0, 1, 0, 0, 0, 5,    0, 0, 1, 0, 0, 0, 5,    0, 0, 0, 1, 0, 1, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5,    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5 ),         free = c(    T,    F, T,    F, F, T,    F, T, F, T,    T, F, F, F, T,    F, T, F, F, F, T,    F, F, T, F, F, F, T,    F, F, F, T, F, T, F, T,    F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, T,    F, F, F, F, F, F, F, F, F, F, F, F, F, T ), byrow=TRUE, name="S" ) mx.Filter <- mxMatrix("Full", nrow=11, ncol=14, values= c(        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,      0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0    ),    free=FALSE,    name="Filter",    byrow = TRUE ) And finally, we will create our identity and filter matrices the same way, as follows: mx.I <- mxMatrix("Full", nrow=14, ncol=14,    values= c(        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1    ),    free=FALSE,    byrow = TRUE,    name="I" ) Fitting the model Now, it is time to declare the model that we would like to fit using the mxModel command. This part includes steps 2 through step 4 mentioned earlier. Here, we will tell mxModel which matrices to use. We will then use the mxAlgegra command to tell R how the matrices should be combined to reproduce the implied covariance matrix. We will tell R to use ML estimation with the mxMLObjective command, and we will tell it to apply the estimation to a particular matrix algebra, which we named "C". This is simply the right-hand side of the McArdle McDonald equation. Finally, we will tell R where to get the data to use in model fitting using the following code: factorModel.1 <- mxModel("Political Democracy Model", #Model Matrices mx.A, mx.S1, mx.Filter, mx.I, #Model Fitting Instructions mxAlgebra(Filter %*% solve(I-A) %*% S %*% t(solve(I - A)) %*% t(Filter), name="C"),      mxMLObjective("C", dimnames = names(PoliticalDemocracy)),    #Data to fit mxData(cov(PoliticalDemocracy), type="cov", numObs=75) ) Now, let's tell R to fit the model and summarize the results using mxRun, as follows: summary(mxRun(factorModel.1)) Running Political Democracy Model Error in summary(mxRun(factorModel.1)) : error in evaluating the argument 'object' in selecting a method for function 'summary': Error: The job for model 'Political Democracy Model' exited abnormally with the error message: Expected covariance matrix is non-positive-definite. Uh oh! We got an error message telling us that the expected covariance matrix is not positive definite. Our observed covariance matrix is positive definite but the implied covariance matrix (at least at first) is not. This is an effect of the fact that if we multiply our starting value matrices together as specified by the McArdle McDonald equation, we get a starting implied covariance matrix. If we perform an eigenvalue decomposition of this starting implied covariance matrix, then we will find that the last eigenvalue is negative. This means a negative variance does not make much sense, and this is what "not positive definite" refers to. The good news is that this is simply our starting values, so we can fix this if we modify our starting values. In this case, we can choose values of five along the diagonal of the S matrix, and get a positive definite starting implied covariance matrix. We can rerun this using the mx.S2 matrix specified earlier and the software will proceed as follows: #Rerun with a positive definite matrix   factorModel.2 <- mxModel("Political Democracy Model", #Model Matrices mx.A, mx.S2, mx.Filter, mx.I, #Model Fitting Instructions mxAlgebra(Filter %*% solve(I-A) %*% S %*% t(solve(I - A)) %*% t(Filter), name="C"),    mxMLObjective("C", dimnames = names(PoliticalDemocracy)),    #Data to fit mxData(cov(PoliticalDemocracy), type="cov", numObs=75) )   summary(mxRun(factorModel.2)) This should provide a solution. As can be seen from the previous code, the parameters solved in the model are returned as matrix components. Just like we had to figure out how to go from paths to matrices, we now have to figure out how to go from matrices to paths (the reverse problem). In the following screenshot, we show just the first few free parameters: The preceding screenshot tells us that the parameter estimated in the position of the tenth row and twelfth column in the matrix A is 2.18. This corresponds to a path from the twelfth variable in the A matrix ind60, to the 10th variable in the matrix x2. Thus, the path coefficient from ind60 to x2 is 2.18. There are a few other pieces of information here. The first one tells us that the model has not converged but is "Mx status Green." This means that the model was still converging when it stopped running (that is, it did not converge), but an optimal solution was still found and therefore, the results are likely reliable. Model fit information is also provided suggesting a pretty good model fit with CFI of 0.99 and RMSEA of 0.032. This was a fair amount of work, and creating model matrices by hand from path diagrams can be quite tedious. For this reason, SEM fitting programs have generally adopted the ability to fit SEM by declaring paths rather than model matrices. OpenMx has the ability to allow declaration by paths, but applying model matrices has a few advantages. Principally, we get under the hood of SEM fitting. If we step back, we can see that OpenMx actually did very little for us that is specific to SEM. We told OpenMx how we wanted matrices multiplied together and which parameters of the matrix were free to be estimated. Instead of using the RAM specification, we could have passed the matrices of the LISREL or Bentler-Weeks models with the corresponding algebra methods to recreate an implied covariance matrix. This means that if we are trying to come up with our matrix specification, reproduce prior research, or apply a new SEM matrix specification method published in the literature, OpenMx gives us the power to do it. Also, for educators wishing to teach the underlying mathematical ideas of SEM, OpenMx is a very powerful tool. Fitting SEM models using lavaan If we were to describe OpenMx as the SEM equivalent of having a well-stocked pantry and full kitchen to create whatever you want, and you have the time and know how to do it, we might regard lavaan as a large freezer full of prepackaged microwavable dinners. It does not allow quite as much flexibility as OpenMx because it sweeps much of the work that we did by hand in OpenMx under the rug. Lavaan does use an internal matrix representation, but the user never has to see it. It is this sweeping under the rug that makes lavaan generally much easier to use. It is worth adding that the list of prepackaged features that are built into lavaan with minimal additional programming challenge many commercial SEM packages. The lavaan syntax The key to describing lavaan models is the model syntax, as follows: X =~ Y: Y is a manifestation of the latent variable X Y ~ X: Y is regressed on X Y ~~ X: The covariance between Y and X can be estimated Y ~ 1: This estimates the intercept for Y (implicitly requires mean structure) Y | a*t1 + b*t2: Y has two thresholds that is a and b Y ~ a * X: Y is regressed on X with coefficient a Y ~ start(a) * X: Y is regressed on X; the starting value used for estimation is a It may not be evident at first, but this model description language actually makes lavaan quite powerful. Wherever you have seen a or b in the previous examples, a variable or constant can be used in their place. The beauty of this is that multiple parameters can be constrained to be equal simply by assigning a single parameter name to them. Using lavaan, we can fit a factor analysis model to our physical functioning dataset with only a few lines of code: phys.func.data <- read.csv('phys_func.csv')[-1] names(phys.func.data) <- LETTERS[1:20] R has a built-in vector named LETTERS, which contains all of the capital letters of the English alphabet. The lower case vector letters contains the lowercase alphabet. We will then describe our model using the lavaan syntax. Here, we have a model of three latent variables, our factors, and each of them has manifest variables. Let's take a look at the following example: model.definition.1 <- ' #Factors    Cognitive =~ A + Q + R + S    Legs =~ B + C + D + H + I + J + M + N    Arms =~ E + F+ G + K +L + O + P + T    #Correlations Between Factors    Cognitive ~~ Legs    Cognitive ~~ Arms    Legs ~~ Arms ' We then tell lavaan to fit the model as follows: fit.phys.func <- cfa(model.definition.1, data=phys.func.data, ordered= c('A','B', 'C','D', 'E','F','G', 'H','I','J', 'K', 'L','M','N','O','P','Q','R', 'S', 'T')) In the previous code, we add an ordered = argument, which tells lavaan that some variables are ordinal in nature. In response, lavaan estimates polychoric correlations for these variables. Polychoric correlations assume that we binned a continuous variable into discrete categories, and attempts to explicitly model correlations assuming that there is some continuous underlying variable. Part of this requires finding thresholds (placed on an arbitrary scale) between each categorical response. (for example, threshold 1 falls between the response of 1 and 2, and so on). By telling lavaan to treat some variables as categorical, lavaan will also know to use a special estimation method. Lavaan will use diagonally weighted least squares, which does not assume normality and uses the diagonals of the polychoric correlation matrix for weights in the discrepancy function. With five response options, it is questionable as to whether polychoric correlations are truly needed. Some analysts might argue that with many response options, the data can be treated as continuous, but here we use this method to show off lavaan's capabilities. All SEM models in lavaan use the lavaan command. Here, we use the cfa command, which is one of a number of wrapper functions for the lavaan command. Others include sem and growth. These commands differ in the default options passed to the lavaan command. (For full details, see the package documentation.) Summarizing the data, we can see the loadings of each item on the factor as well as the factor intercorrelations. We can also see the thresholds between each category from the polychoric correlations as follows: summary(fit.phys.func) We can also assess things such as model fit using the fitMeasures command, which has most of the popularly used fit measures and even a few obscure ones. Here, we tell lavaan to simply extract three measures of model fit as follows: fitMeasures(fit.phys.func, c('rmsea', 'cfi', 'srmr')) Collectively, these measures suggest adequate model fit. It is worth noting here that the interpretation of fit measures largely comes from studies using maximum likelihood estimation, and there is some debate as to how well these generalize other fitting methods. The lavaan package also has the capability to use other estimators that treat the data as truly continuous in nature. For this, a particular dataset is far from multivariate normal distributed, so an estimator such as ML is appropriate to use. However, if we wanted to do so, the syntax would be as follows: fit.phys.func.ML <- cfa(model.definition.1, data=phys.func.data, estimator = 'ML') Comparing OpenMx to lavaan It can be seen that lavaan has a much simpler syntax that allows to rapidly model basic SEM models. However, we were a bit unfair to OpenMx because we used a path model specification for lavaan and a matrix specification for OpenMx. The truth is that OpenMx is still probably a bit wordier than lavaan, but let's apply a path model specification in each to do a fair head-to-head comparison. We will use the famous Holzinger-Swineford 1939 dataset here from the lavaan package to do our modeling, as follows: hs.dat <- HolzingerSwineford1939 We will create a new dataset with a shorter name so that we don't have to keep typing HozlingerSwineford1939. Explaining an example in lavaan We will learn to fit the Holzinger-Swineford model in this section. We will start by specifying the SEM model using the lavaan model syntax: hs.model.lavaan <- ' visual =~ x1 + x2 + x3 textual =~ x4 + x5 + x6 speed   =~ x7 + x8 + x9   visual ~~ textual visual ~~ speed textual ~~ speed '   fit.hs.lavaan <- cfa(hs.model.lavaan, data=hs.dat, std.lv = TRUE) summary(fit.hs.lavaan) Here, we add the std.lv argument to the fit function, which fixes the variance of the latent variables to 1. We do this instead of constraining the first factor loading on each variable to 1. Only the model coefficients are included for ease of viewing in this book. The result is shown in the following model: > summary(fit.hs.lavaan) …                      Estimate Std.err Z-value P(>|z|) Latent variables: visual =~    x1               0.900   0.081   11.127   0.000    x2               0.498   0.077   6.429   0.000    x3              0.656   0.074   8.817   0.000 textual =~    x4               0.990   0.057   17.474   0.000    x5               1.102   0.063   17.576   0.000    x6               0.917   0.054   17.082   0.000 speed =~    x7               0.619   0.070   8.903   0.000    x8               0.731   0.066   11.090   0.000    x9               0.670   0.065   10.305   0.000   Covariances: visual ~~    textual           0.459   0.064   7.189   0.000    speed             0.471   0.073   6.461   0.000 textual ~~    speed             0.283   0.069   4.117   0.000 Let's compare these results with a model fit in OpenMx using the same dataset and SEM model. Explaining an example in OpenMx The OpenMx syntax for path specification is substantially longer and more explicit. Let's take a look at the following model: hs.model.open.mx <- mxModel("Holzinger Swineford", type="RAM",      manifestVars = names(hs.dat)[7:15], latentVars = c('visual', 'textual', 'speed'),    # Create paths from latent to observed variables mxPath(        from = 'visual',        to = c('x1', 'x2', 'x3'),    free = c(TRUE, TRUE, TRUE),    values = 1          ), mxPath(        from = 'textual',        to = c('x4', 'x5', 'x6'),        free = c(TRUE, TRUE, TRUE),        values = 1      ), mxPath(    from = 'speed',    to = c('x7', 'x8', 'x9'),    free = c(TRUE, TRUE, TRUE),    values = 1      ), # Create covariances among latent variables mxPath(    from = 'visual',    to = 'textual',    arrows=2,    free=TRUE      ), mxPath(        from = 'visual',        to = 'speed',        arrows=2,        free=TRUE      ), mxPath(        from = 'textual',        to = 'speed',        arrows=2,        free=TRUE      ), #Create residual variance terms for the latent variables mxPath(    from= c('visual', 'textual', 'speed'),    arrows=2, #Here we are fixing the latent variances to 1 #These two lines are like st.lv = TRUE in lavaan    free=c(FALSE,FALSE,FALSE),    values=1 ), #Create residual variance terms mxPath( from= c('x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9'),    arrows=2, ),    mxData(        observed=cov(hs.dat[,c(7:15)]),        type="cov",        numObs=301    ) )     fit.hs.open.mx <- mxRun(hs.model.open.mx) summary(fit.hs.open.mx) Here are the results of the OpenMx model fit, which look very similar to lavaan's. This gives a long output. For ease of viewing, only the most relevant parts of the output are included in the following model (the last column that R prints giving the standard error of estimates is also not shown here): > summary(fit.hs.open.mx) …   free parameters:                            name matrix     row     col Estimate Std.Error 1   Holzinger Swineford.A[1,10]     A     x1 visual 0.9011177 2   Holzinger Swineford.A[2,10]     A     x2 visual 0.4987688 3   Holzinger Swineford.A[3,10]     A     x3 visual 0.6572487 4   Holzinger Swineford.A[4,11]     A     x4 textual 0.9913408 5   Holzinger Swineford.A[5,11]     A     x5 textual 1.1034381 6   Holzinger Swineford.A[6,11]     A     x6 textual 0.9181265 7   Holzinger Swineford.A[7,12]     A     x7   speed 0.6205055 8   Holzinger Swineford.A[8,12]     A     x8 speed 0.7321655 9   Holzinger Swineford.A[9,12]     A     x9   speed 0.6710954 10   Holzinger Swineford.S[1,1]     S     x1     x1 0.5508846 11   Holzinger Swineford.S[2,2]     S     x2     x2 1.1376195 12   Holzinger Swineford.S[3,3]     S    x3     x3 0.8471385 13   Holzinger Swineford.S[4,4]     S     x4     x4 0.3724102 14   Holzinger Swineford.S[5,5]     S     x5     x5 0.4477426 15   Holzinger Swineford.S[6,6]     S     x6     x6 0.3573899 16   Holzinger Swineford.S[7,7]      S     x7     x7 0.8020562 17   Holzinger Swineford.S[8,8]     S     x8     x8 0.4893230 18   Holzinger Swineford.S[9,9]     S     x9     x9 0.5680182 19 Holzinger Swineford.S[10,11]     S visual textual 0.4585093 20 Holzinger Swineford.S[10,12]     S visual   speed 0.4705348 21 Holzinger Swineford.S[11,12]     S textual   speed 0.2829848 In summary, the results agree quite closely. For example, looking at the coefficient for the path going from the latent variable visual to the observed variable x1, lavaan gives an estimate of 0.900 while OpenMx computes a value of 0.901. Summary The lavaan package is user friendly, pretty powerful, and constantly adding new features. Alternatively, OpenMx has a steeper learning curve but tremendous flexibility in what it can do. Thus, lavaan is a bit like a large freezer full of prepackaged microwavable dinners, whereas OpenMx is like a well-stocked pantry with no prepared foods but a full kitchen that will let you prepare it if you have the time and the know-how. To run a quick analysis, it is tough to beat the simplicity of lavaan, especially given its wide range of capabilities. For large complex models, OpenMx may be a better choice. The methods covered here are useful to analyze statistical relationships when one has all of the data from events that have already occurred. Resources for Article: Further resources on this subject: Creating your first heat map in R [article] Going Viral [article] Introduction to S4 Classes [article]
Read more
  • 0
  • 0
  • 6841

article-image-9-recommended-blockchain-online-courses
Guest Contributor
27 Sep 2018
7 min read
Save for later

9 recommended blockchain online courses

Guest Contributor
27 Sep 2018
7 min read
Blockchain is reshaping the world as we know it. And we are not talking metaphorically because the new technology is really influencing everything from online security and data management to governance and smart contracting. Statistical reports support these claims. According to the study, the blockchain universe grows by over 40% annually, while almost 70% of banks are already experimenting with this technology. IT experts at the Editing AussieWritings.com Services claim that the potential in this field is almost limitless: “Blockchain offers a myriad of practical possibilities, so you definitely want to get acquainted with it more thoroughly.” Developers who are curious about blockchain can turn it into a lucrative career opportunity since it gives them the chance to master the art of cryptography, hierarchical distribution, growth metrics, transparent management, and many more. There were 5,743 mostly full-time job openings calling for blockchain skills in the last 12 months - representing the 320% increase - while the biggest freelancing website Upwork reported more than 6,000% year-over-year growth. In this post, we will recommend our 9 best blockchain online courses. Let’s take a look! Udemy Udemy offers users one of the most comprehensive blockchain learning sources. The target audience is people who have heard a little bit about the latest developments in this field, but want to understand more. This online course can help you to fully understand how the blockchain works, as well as get to grips with all that surrounds it. Udemy breaks down the course into several less complicated units, allowing you to figure out this complex system rather easily. It costs $19.99, but you can probably get it with a 40% discount. The one downside, however, is that content quality in terms of subject scope can vary depending on the instructor, but user reviews are a good way to gauge quality. Each tutorial lasts approximately 30 minutes, but it also depends on your own tempo and style of work. Pluralsight Pluralsight is an excellent beginner-level blockchain course. It comes in three versions: Blockchain Fundamentals, Surveying Blockchain Technologies for Enterprise, and Introduction to Bitcoin and Decentralized Technology. Course duration varies from 80 to 200 minutes depending on the package. The price of Pluralsight is $29 a month or $299 a year. Choosing one of these options, you are granted access to the entire library of documents, including course discussions, learning paths, channels, skill assessments, and other similar tools. Packt Publishing Packt Publishing has a wide portfolio of learning products on Blockchain for varying levels of experience in the field from beginners to experts. And what’s even more interesting is that you can choose your learning format from books, ebooks to videos, courses and live courses. Or you could simply subscribe to MAPT, their library to gain access to all products at a reasonable price of $29 monthly and $150 annually.  It offers several books and videos on the leading blockchain technology. You can purchase 5 blockchain titles at a discounted rate of $50. Here’s the list of top blockchain courses offered by Packt Publishing: Exploring Blockchain and Crypto-currencies: You will gain the foundational understanding of blockchain and crypto-currencies through various use-cases. Building Blockchain Projects: In this, you will be able to develop real-time practical DApps with Ethereum and JavaScript. Mastering Blockchain - Second Edition: You can learn about cryptography and cryptocurrencies, so you can build highly secure, decentralized applications and conduct trusted in-app transactions. Hands-On Blockchain with Hyperledger: This book will help you leverage the power of Hyperledger Fabric to develop Blockchain-based distributed ledgers with ease. Learning Blockchain Application Development [video ]: This interactive video will help you learn build smart contracts and DApps on Ethereum. Create Ethereum and Blockchain Applications using Solidity [video ]: This video will help you learn about Ethereum, Solidity, DAO, ICO, Bitcoin, Altcoin, Website Security, Ripple, Litecoin, Smart Contracts, and Apps. Cryptozombies Cryptozombies is an online blockchain course based on gamification elements. The tool teaches you to write smart contracts in Solidity through building your own crypto-collectibles game. It is entirely Ethereum-focused, but you don’t need any previous experience to understand how Solidity works. There is a step by step guide that explains to you even the smallest details, so you can quickly learn to create your own fully-functional blockchain-based game. The best thing about Cryptozombies is that you can test it for free and give up in case you don’t like it. Coursera The blockchain is the epicenter of the cryptocurrency world, so it’s necessary to study it if you want to deal with Bitcoin and other digital currencies. Coursera is the leading online resource in the field of virtual currencies, so you might want to check it out. After this course like Blockchain Specialization, you’ll know everything you need to be able to separate fact from fiction when reading claims about Bitcoin and other cryptocurrencies. You’ll have the conceptual foundations you need to engineer to secure software that interacts with the Bitcoin network. And you’ll be able to integrate ideas from Bitcoin in your own projects. The course is a 4-part course spanning a duration 4 weeks, but you can take each part separately. The price depends on the level and features you choose. LinkedIn Learning (formerly known as Lynda) LinkedIn Learning (what used to be Lynda) doesn't offer a specific blockchain course, but it does have a wide range of industry-related learning sources. A search for ‘blockchain’ will present you with almost 100 relevant video courses. You can find all sorts of lessons here, from beginner to expert levels. Lynda allows you to customize selection according to video duration, authors, software, subjects, etc. You can access the library for $15 a month. B9Lab B9Lab ETH-25 Certified Online Ethereum Developer Course is another course that promotes blockchain technology aimed at the Ethereum platform. It’s a 12-week in-depth learning solution that targets experienced programmers. B9Lab introduces everything there is to know about blockchain and how to build useful applications. Participants are taught about the Ethereum platform, the programming language Solidity, how to use web3 and the Truffle framework, and how to tie everything together. The price is €1450 or about $1700. IBM IBM made a self-paced blockchain course, titled Blockchain Essentials that lasts over two hours. The video lectures and lab in this course help you learn about blockchain for business and explore key use cases that demonstrate how the technology adds value. You can learn how to leverage blockchain benefits, transform your business with the new technology, and transfer assets. Besides that, you get a nice wrap-up and a quiz to test your knowledge upon completion. IBM’s course is free of charge. Khan Academy Khan Academy is the last, but certainly not the least important online course on our list. It gives users a comprehensive overview of blockchain-powered systems, particularly Bitcoin. Using this platform, you can learn more on cryptocurrency transactions, security, proof of work, etc. As an online education platform, Khan Academy won’t cost you a dime. [dropcap]B[/dropcap]lockchain is the groundbreaking technology that opens new boundaries in almost every field of business. It directly influences financial markets, data management, digital security, and a variety of other industries. In this post, we presented 9 best blockchain online courses you should try. These sources can teach you everything there is to know about the blockchain basics. Take some time to check them out and you won’t regret it! Author Bio: Olivia is a passionate blogger who writes on topics of digital marketing, career, and self-development. She constantly tries to learn something new and to share this experience on various websites. Connect with her on Facebook and Twitter. Google introduces Machine Learning courses for AI beginners Microsoft start AI School to teach Machine Learning and Artificial Intelligence.
Read more
  • 0
  • 0
  • 6810
article-image-oracle-e-business-suite-creating-bank-accounts-and-cash-forecasts
Packt
19 Aug 2011
3 min read
Save for later

Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts

Packt
19 Aug 2011
3 min read
  Oracle E-Business Suite 12 Financials Cookbook Take the hard work out of your daily interactions with E-Business Suite financials by using the 50+ recipes from this cookbook   Introduction Oracle E-business suite The liquidity of an organization is managed in Oracle Cash Management; this includes the reconciliation of the cashbook to the bank statements, and forecasting future cash requirements. In this article, we will look at how to create bank accounts and cash forecasts. Cash management integrates with Payables, Receivables, Payroll, Treasury, and General Ledger. Let's start by looking at the cash management process: The Bank generates statements. The statements are sent to the organization electronically or by post. The Treasury Administrator loads and verifies the bank statement into cash management. The statements can also be manually entered into cash management. The loaded statements are reconciled to the cash book transactions. The results are reviewed, and amended if required. The Treasury Administrator creates the journals for transactions in the General Ledger. Creating bank accounts Oracle Cash Management provides us with the functionality to create bank accounts. In this recipe, we will create a bank account for a bank called Shepherd Bank, for one of their branches called Kings Cross branch. Getting ready Log in to Oracle E-Business Suite R12 with the username and password assigned to you by the system administrator. If you are working on the Vision demonstration database, you can use OPERATIONS/WELCOME as the USERNAME/PASSWORD. We also need to create a bank before we can create the bank account. Let's look at how to create a bank and the branch: Select the Cash Management responsibility. Navigate to Setup | Banks | Banks.(Move the mouse over the image to enlarge it.) In the Banks tab, click on the Create button. Select the Create new bank option. In the Country field, enter United States. In the Bank Name field, enter Shepherds Bank. In the Bank Number field, enter JN316. Click on the Finish button. Let's create the branch and the address: (Move the mouse over the image to enlarge it.) Click the Create Branch icon: The Country and the Bank Name are automatically entered. Click on the Continue button.(Move the mouse over the image to enlarge it.) In the Branch Name field, enter Kings Cross. Select ABA as the Branch Type. Click on the Save and Next button to create the Branch address.(Move the mouse over the image to enlarge it.) In the Branch Address form, click on the create button. In the Country field, enter United States. In the Address Line 1 field, enter 4234 Red Eagle Road. In the City field, enter Sacred Heart. In the County field, enter Renville. In the State field, enter MN. In the Postal Code field, enter 56285. Ensure that the Status field is Active. Click on the Apply button. Click on the Finish button.
Read more
  • 0
  • 0
  • 6802

article-image-visualizations-using-ccc
Packt
20 May 2016
28 min read
Save for later

Visualizations Using CCC

Packt
20 May 2016
28 min read
In this article by Miguel Gaspar the author of the book Learning Pentaho CTools you will learn about the Charts Component Library in detail. The Charts Components Library is not really a Pentaho plugin, but instead is a Chart library that Webdetails created some years ago and that Pentaho started to use on the Analyzer visualizations. It allows a great level of customization by changing the properties that are applied to the charts and perfectly integrates with CDF, CDE, and CDA. (For more resources related to this topic, see here.) The dashboards that Webdetails creates make use of the CCC charts, usually with a great level of customization. Customizing them is a way to make them fancy and really good-looking, and even more importantly, it is a way to create a visualization that best fits the customer/end user's needs. We really should be focused on having the best visualizations for the end user, and CCC is one of the best ways to achieve this, but do this this you need to have a very deep knowledge of the library, and know how to get amazing results. I think I could write an entire book just about CCC, and in this article I will only be able to cover a small part of what I like, but I will try to focus on the basics and give you some tips and tricks that could make a difference.I'll be happy if I can give you some directions that you follow, and then you can keep searching and learning about CCC. An important part of CCC is understanding some properties such as the series in rows or the crosstab mode, because that is where people usually struggle at the start. When you can't find a property to change some styling/functionality/behavior of the charts, you might find a way to extend the options by using something called extension points, so we will also cover them. I also find the interaction within the dashboard to be an important feature.So we will look at how to use it, and you will see that it's very simple. In this article,you will learn how to: Understand the properties needed to adapt the chart to your data source results Use the properties of a CCC chart Create a CCC chat by using the JavaScript library Make use of internationalization of CCC charts See how to handle clicks on charts Scale the base axis Customize the tooltips Some background on CCC CCC is built on top of Protovis, a JavaScript library that allows you to produce visualizations just based on simple marks such as bars, dots, and lines, among others, which are created through dynamic properties based on the data to be represented. You can get more information on this at: http://mbostock.github.io/protovis/. If you want to extend the charts with some elements that are not available you can, but it would be useful to have an idea about how Protovis works.CCC has a great website, which is available at http://www.webdetails.pt/ctools/ccc/, where you can see some samples including the source code. On the page, you can edit the code, change some properties, and click the apply button. If the code is valid, you will see your chart update.As well as that, it provides documentation for almost all of the properties and options that CCC makes available. Making use of the CCC library in a CDF dashboard As CCC is a chart library, you can use it as you would use it on any other webpage, by using it like the samples on CCC webpages. But CDF also provides components that you can implement to use a CCC chart on a dashboard and fully integrate with the life cycle of the dashboard. To use a CCC chart on CDF dashboard, the HTML that is invoked from the XCDF file would look like the following(as we already covered how to build a CDF dashboard, I will not focus on that, and will mainly focus on the JavaScript code): <div class="row"> <div class="col-xs-12"> <div id="chart"/> </div> </div> <script language="javascript" type="text/javascript">   require(['cdf/Dashboard.Bootstrap', 'cdf/components/CccBarChartComponent'], function(Dashboard, CccBarChartComponent) {     var dashboard = new Dashboard();     var chart = new CccBarChartComponent({         type: "cccBarChart",         name: "cccChart",         executeAtStart: true,         htmlObject: "chart",         chartDefinition: {             height: 200,             path: "/public/…/queries.cda",             dataAccessId: "totalSalesQuery",             crosstabMode: true,             seriesInRows: false, timeSeries: false             plotFrameVisible: false,             compatVersion: 2         }     });     dashboard.addComponent(chart);     dashboard.init();   }); </script> The most important thing here is the use of the CCC chart component that we have covered as an example in which we have covered it's a bar chart. We can see by the object that we are instantiating CccBarChartComponent as also by the type that is cccBarChart. The previous dashboard will execute the query specified as dataAccessId of the CDA file set on the property path, and render the chart on the dashboard. We are also saying that its data comes from the query in the crosstab mode, but the base axis should not be atimeSeries. There are series in the columns, but don't worry about this as we'll be covering it later. The existing CCC components that you are able to use out of the box inside CDF dashboards are as follows. Don't forget that CCC has plenty of charts, so the sample images that you will see in the following table are just one example of the type of charts you can achieve. CCC Component Chart Type Sample Chart CccAreaChartComponent cccAreaChart   CccBarChartComponent cccBarChart http://www.webdetails.pt/ctools/ccc/#type=bar CccBoxplotChartComponent cccBoxplotChart http://www.webdetails.pt/ctools/ccc/#type=boxplot CccBulletChartComponent cccBulletChart http://www.webdetails.pt/ctools/ccc/#type=bullet CccDotChartComponent cccDotChart http://www.webdetails.pt/ctools/ccc/#type=dot CccHeatGridChartComponent cccHeatGridChart http://www.webdetails.pt/ctools/ccc/#type=heatgrid CccLineChartComponent cccLineChart http://www.webdetails.pt/ctools/ccc/#type=line CccMetricDotChartComponent cccMetricDotChart http://www.webdetails.pt/ctools/ccc/#type=metricdot CccMetricLineChartComponent cccMetricLineChart   CccNormalizedBarChartComponent cccNormalizedBarChart   CccParCoordChartComponent cccParCoordChart   CccPieChartComponent cccPieChart http://www.webdetails.pt/ctools/ccc/#type=pie CccStackedAreaChartComponent cccStackedAreaChart http://www.webdetails.pt/ctools/ccc/#type=stackedarea CccStackedDotChartComponent cccStackedDotChart   CccStackedLineChartComponent cccStackedLineChart http://www.webdetails.pt/ctools/ccc/#type=stackedline CccSunburstChartComponent cccSunburstChart http://www.webdetails.pt/ctools/ccc/#type=sunburst CccTreemapAreaChartComponent cccTreemapAreaChart http://www.webdetails.pt/ctools/ccc/#type=treemap CccWaterfallAreaChartComponent cccWaterfallAreaChart http://www.webdetails.pt/ctools/ccc/#type=waterfall In the sample code, you will find a property calledcompatMode that hasa value of 2 set. This will make CCC work as a revamped version that delivers more options, a lot of improvements, and makes it easier to use. Mandatory and desirable properties Among otherssuch as name, datasource, and htmlObject, there are other properties of the charts that are mandatory. The height is really important, because if you don't set the height of the chart, you will not fit the chart in the dashboard. The height should also be specified in pixels. If you don't set the width of the component, or to be more precise, then the chart will grab the width of the element where it's being rendered it will grab the width of the HTML element with the name specified in the htmlObject property. The seriesInRows, crosstabMode, and timeseriesproperties are optional, but depending on the kind of chart you are generating, you might want to specify them. The use of these properties becomes clear if we can also see the output of the queries we are executing. We need to get deeper into the properties that are related to the data mapping to visual elements. Mapping data We need to be aware of the way that data mapping is done in the chart.You can understand how it works if you can imagine data input as a table. CCC can receive the data as two different structures: relational and crosstab. If CCC receives data as crosstab,it will translate it to a relational structure. You can see this in the following examples. Crosstab The following table is an example of the crosstab data structure: Column Data 1 Column Data 2 Row Data 1 Measure Data 1.1 Measure Data 1.2 Row Data 2 Measure Data 2.1 Measure Data 2.2 Creating crosstab queries To create a crosstab query, usually you can do this with the group when using SQL, or just use MDX, which allows us to easily specify a set for the columns and for the rows. Just by looking at the previous and following examples, you should be able to understand that in the crosstab structure (the previous), columns and rows are part of the result set, while in the relational format (the following), column headers or headers are not part of the result set, but are part of the metadata that is returned from the query. The relationalformat is as follows: Column Row Value Column Data 1 Row Data 1 Measure Data 1.1 Column Data 2 Row Data 1 Measure Data 2.1 Column Data 1 Row Data 2 Measure Data 1.2 Column Data 2 Row Data 2 Measure Data 2.1   The preceding two data structures represent the options when setting the properties crosstabMode and seriesInRows. The crosstabMode property To better understand these concepts, we will make use of a real example. This property, crosstabMode, is easy to understand when comparing the two that represents the results of two queries. Non-crosstab (Relational): Markets Sales APAC 1281705 EMEA 50028224 Japan 503957 NA 3852061 Crosstab: Markets 2003 2004 2005 APAC 3529 5938 3411 EMEA 16711 23630 9237 Japan 2851 1692 380 NA 13348 18157 6447   In the previous tables, you can see that on the left-handside you can find the values of sales from each of the territories. The only relevant information relative to the values presented in only one variable, territories. We can say that we are able to get all the information just by looking at the rows, where we can see a direct connection between markets and the sales value. In the table presented on the right, you will find a value for each territory/year, meaning that the values presented, and in the sample provided in the matrix, are dependent on two variables, which are the territory in the rows and the years in the columns. Here we need both the rows andthe columns to know what each one of the values represents. Relevant information can be found in the rows and the columns, so this a crosstab. The crosstabs display the joint distribution of two or more variables, and are usually represented in the form of a contingency table in a matrix. When the result of a query is dependent only on one variable, then you should set the crosstabModeproperty to false. When it is dependent on 2 or more variables, you should set the crosstabMode property to false, otherwise CCC will just use the first two columns like in the non-crosstab example. The seriesInRows property Now let's use the same examplewhere we have a crosstab: The previous image shows two charts: the one on the left is a crosstab with the series in the rows, and the one on the right is also crosstab but the series are not in the rows (the series are in the columns).When the crosstab is set to true, it means that the measure column title can be translated as a series or a category,and that's determined by the property seriesInRows. If this property is set to true, then it will read the series from the rows, otherwise it will read the series from the columns. If the crosstab is set to false, the community chart component is expecting a row to correspond exactly to one data point, and two or three columns can be returned. When three columns are returned, they can be a category, series and dataor series, category and data and that's determined by the seriesInRows property. When set to true, CCC will expect the structure to have three columns such as category, series, and data. When it is set to false, it will expect them to be series, category, and data. A simple table should give you a quicker reference, so here goes: crosstabMode seriesInRows Description true true The column titles will act as category values while the series values are represented as data points of the first column. true false The column titles will act as series value while the category/category values are represented as data points of the first column. false true The column titles will act as category values while the series values are represented as data points of the first column. false false The column titles will act as category values while the series values are represented as data points of the first column. The timeSeries and timeSeriesFormat properties The timeSeries property defines whether the data to be represented by the chart is discrete or continuous. If we want to present some values over time, then the timeSeries property should be set to true. When we set the chart to be timeSeries, we also need to set another property to tell CCC how it should interpret the dates that come from the query.Check out the following image for timeSeries and timeSeriesFormat: The result of one of the queries has the year and the abbreviated month name separate by -, like 2015-Nov. For the chart to understand it as a date, we need to specify the format by setting the property timeSeriesFomart, which in our example would be %Y-%b, where %Y is the year is represented by four digits, and %b is the abbreviated month name. The format should be specified using the Protovis format that follows the same format as strftime in the C programming language, aside from some unsupported options. To find out what options are available, you should take a look at the documentation, which you will find at: https://mbostock.github.io/protovis/jsdoc/symbols/pv.Format.date.html. Making use of CCC inCDE There are a lot of properties that will use a default value, and you can find out aboutthem by looking at the documentation or inspecting the code that is generated by CDE when you use the charts components. By looking at the console log of your browser, you should also able to understand and get some information about the properties being used by default and/or see whether you are using a property that does not fit your needs. The use of CCC charts in CDE is simpler, just because you may not need to code. I am only saying may because to achieve quicker results, you may apply some code and make it easier to share properties among different charts or type of charts. To use a CCC chart, you just need to select the property that you need to change and set its value by using the dropdown or by just setting the value: The previous image shows a group of properties with the respective values on the right side. One of the best ways to start to get used to the CCC properties is to use the CCC page available as part of the Webdetails page: http://www.webdetails.pt/ctools/ccc. There you will find samples and the properties that are being used for each of the charts. You can use the dropdown to select different kinds of charts from all those that are available inside CCC. You also have the ability to change the properties and update the chart to check the result immediately. What I usually do, as it's easier and faster, is to change the properties here and check the results and then apply the necessary values for each of the properties in the CCC charts inside the dashboards. In the following samples, you will also find documentation about the properties, see where the properties are separated by sections of the chart, and after that you will find the extension points. On the site, when you click on a property/option, you will be redirected to another page where you will find the documentation and how to use it. Changing properties in the preExecution or postFetch We are able to change the properties for the charts, as with any other component. Inside the preExecution, this, refers to the component itself, so we will have access to the chart's main object, which we can also manipulate and add, remove, and change options. For instance, you can apply the following code: function() {    var cdProps = {         dotsVisible: true,         plotFrame_strokeStyle: '#bbbbbb',         colors: ['#005CA7', '#FFC20F', '#333333', '#68AC2D']     };     $.extend(true, this.chartDefinition, cdProps); } What we are doing is creating an object with all the properties that we want to add or change for the chart, and then extending the chartDefinitions (where the properties or options are). This is what we are doing with the JQuery function, extending. Use the CCC website and make your life easier This way to apply options makes it easier to set the properties. Just change or add the properties that you need, test it, and when you're happy with the result, you just need to copy them into the object that will extend/overwrite the chart options. Just keep in mind that the properties you change directly in the editor will be overwritten by the ones defined in the preExecution, if they match each other of course. Why is this important? It's because not all the properties that you can apply to CCC are exposed in CDE, so you can use the preExecution to use or set those properties. Handling the click event One important thing about the charts is that they allow interaction. CCC provides a way to handle some events in the chart and click is one of those events. To have it working, we need to change two properties: clickable, which needs to be set to true, and clickAction where we need to write a function with the code to be executed when a click happens. The function receives one argument that usually is referred to as a scene. The scene is an object that has a lot of information about the context where the event happened. From the object you will have access to vars, another object where we can find the series and the categories where the clicked happened. We can use the function to get the series/categories being clicked and perform a fireChange that can trigger updates on other components: function(scene) {     var series =  "Series:"+scene.atoms.series.label;     var category =  "Category:"+scene.vars.category.label;     var value = "Value:"+scene.vars.value.label;     Logger.log(category+"&"+value);     Logger.log(series); } In the previous code example, you can find the function to handle the click action for a CCC chart. When the click happens, the code is executed, and a variable with the click series is taken from scene.atoms.series.label. As well as this, the categories clickedscene.vars.category.label and the value that crosses the same series/category in scene.vars.value.value. This is valid for a crosstab, but you will not find the series when it's non-crosstab. You can think of a scene as describing one instance of visual representation. It is generally local to each panel or section of the chart and it's represented by a group of variables that are organized hierarchically. Depending on the scene, it may contain one or many datums. And you must be asking what a hell is a datum? A datum represents a row, so it contains values for multiple columns. We also can see from the example that we are referring to atoms, which hold at least a value, a label, and a key of a column. To get a better understanding of what I am talking about, you should perform a breakpoint anywhere in the code of the previous function and explore the object scene. In the previous example, you would be able to access to the category, series labels, and value, as you can see in the following table:   Corosstab Non-crosstab Value scene.vars.value.label or scene.getValue(); scene.vars.value.label or scene.getValue(); Category scene.vars.category.label or scene.getCategoryLabel(); scene.vars.category.label or scene.getCategoryLabel(); Series scene.atoms.series.label or scene.getSeriesLabel()   For instance, if you add the previous function code to a chart that is a crosstab where the categories are the years and the series are the territories, if you click on the chart, the output would be something like: [info] WD: Category:2004 & Value:23630 [info] WD: Series:EMEA This means that you clicked on the year 2004 for the EMEA. EMEA sales for the year 2004 were 23,630. If you replace the Logger functions withfireChangeas follows, you will be able to make use of the label/value of the clicked category to render other components and some details about them: this.dashboard.fireChange("parameter", scene.vars.category.label); Internationalization of CCCCharts We already saw that all the values coming from the database should not need to be translated. There are some ways in Pentaho to do this, but we may still need to set the title of a chart, where the title should be also internationalized. Another case is when you have dates where the month is represented by numbers in the base axis, but you want to display the month's abbreviated name. This name could be also translated to different languages, which is not hard. For the title, sub-title, and legend, the way to do it is using the instructions on how to set properties on preExecution.First, you will need to define the properties files for the internationalization and set the properties/translations: var cd = this.chartDefinition; cd.title =  this.dashboard.i18nSupport.prop('BOTTOMCHART.TITLE'); To change the title of the chart based on the language defined, we will need to define a function, but we can't use the property on the chart because that will only allow you to define a string, so you will not be able to use a JavaScript instruction to get the text. If you set the previous example code on the preExecution of the chart then, you will be able to. It may also make sense to change not only the titles, but for instance also internationalize the month names. If you are getting data like 2004-02, this may correspond to a time series format as %Y-%m. If that's the case and you want to display the abbreviated month name, then you may use the baseAxisTickFormatter and the dateFormat function from the dashboard utilities, also known as Utils. The code to write inside the preExecution would be like: var cd = this.chartDefinition; cd.baseAxisTickFormatter = function(label) {   return Utils.dateFormat(moment(label, 'YYYY-mmm'), 'MMM/YYYY'); }; The preceding code uses the baseAxisTickFormatter, which allows you to write a function that receives an argument, identified on the code as a label, because it will store the label for each one of the base axis ticks. We are using the dateFormatmethod and moment to format and return the year followed by the abbreviated month name. You can get information about the language defined and being used by running the following instruction moment.locale(); If you need to, you can change the language. Format a basis axis label based on the scale When you are working with a time series chart, you may want to set a different format for the base axis labels. Let's suppose you want to have a chart that is listening to a time selector. If you select one year old data to be displayed on the chart, certainly you are not interested in seeing the minutes on the date label. However, if you want to display the last hour, the ticks of the base axis need to be presented in minutes. There is an extension point we can use to get a conditional format based on the scale of the base axis. The extension point is baseAxisScale_tickFormatter, and it can be used like in the code as follows: baseAxisScale_tickFormatter: function(value, dateTickPrecision) { switch(dateTickPrecision) { casepvc.time.intervals.y: return format_date_year_tick(value);              break; casepvc.time.intervals.m: return format_date_month_tick(value);              break;            casepvc.time.intervals.H: return format_date_hour_tick(value);              break;          default:              return format_date_default_tick(value);   } } It accepts a function with two arguments: the value to be formatted and the tick precision, and should return the formatted label to be presented on each label of the base axis. The previous code shows howthe function is used. You can see a switch that based on the base axis scale will do a different format, calling a function. The functions in the code are not pre-defined—we need to write the functions or code to create the formatting. One example of a function to format the date is that we could use the utils dateFormat function to return the formatted value to the chart. The following table shows the intervals that can be used when verifying which time intervals are being displayed on the chart: Interval Description Number representing the interval y Year 31536e6 m Month 2592e6 d30 30 days 2592e6 d7 7 days 6048e5 d Day 864e5 H Hour 36e5 m Minute 6e4 s Second 1e3 ms Milliseconds 1 Customizing tooltips CCC provides the ability to change the tooltip format that comes by default, and can be changed using the tooltipFormat property. We can change it, making it look likethe following image, on the right side. You can also compare it to the one on the left, which is the default one: The tooltip default format might change depending on the chart type, but also on some options that you apply to the chart, mainly crosstabMode and seriesInRows. The property accepts a function that receives one argument, the scene, which will be a similar structure as already covered for the click event. You should return the HTML to be showed on the dashboard when we hover the chart. In the previous image,you will see on the chart on the left side the defiant tooltip, and on the right a different tooltip. That's because the following code was applied: tooltipFormat: function(scene){   var year = scene.atoms.series.label;   var territory = scene.atoms.category.value;   var sales = Utils.numberFormat(scene.vars.value.value, "#.00A");   var html = '<html>' + <div>Sales for '+year+' at '+territory+':'+sales+'</div>' + '</html>';   return html; } The code is pretty self-explanatory. First we are setting some variables such as year, territory, and the sales values, which we need to present inside the tooltip. Like in the click event, we are getting the labels/value from the scene, which might depend on the properties we set for the chart. For the sales, we are also abbreviating it, using two decimal places. And last, we build the HTML to be displayed when we hover over the chart. You can also change the base axis tooltip Like we are doing to the tooltip when hovering over the values represented in the chart, we can also baseAxisTooltip, just don't forget that the baseAxisTooltipVisible must be set to true (the value by default). Getting the values to show will pretty similar. It can get more complex, though not much more, when we also want for instance, to display the total value of sales for one year or for the territory. Based on that, we could also present the percentage relative to the total. We should use the property as explained earlier. The previous image is one example of how we can customize a tooltip. In this case, we are showing the value but also the percentage that represents the hovered over territory (as the percentage/all the years) and also for the hovered over year (where we show the percentage/all the territories): tooltipFormat: function(scene){   var year = scene.getSeriesLabel();   var territory = scene.getCategoryLabel();   var value = scene.getValue();   var sales = Utils.numberFormat(value, "#.00A");   var totals = {};   _.each(scene.chart().data._datums, function(element) {     var value = element.atoms.value.value;     totals[element.atoms.category.label] =            (totals[element.atoms.category.label]||0)+value;     totals[element.atoms.series.label] =       (totals[element.atoms.series.label]||0)+value;   });   var categoryPerc = Utils.numberFormat(value/totals[territory], "0.0%");   var seriesPerc = Utils.numberFormat(value/totals[year], "0.0%");   var html =  '<html>' + '<div class="value">'+sales+'</div>' + '<div class="dValue">Sales for '+territory+' in '+year+'</div>' + '<div class="bar">'+ '<div class="pPerc">'+categoryPerc+' of '+territory+'</div>'+ '<div class="partialBar" style="width:'+cPerc+'"></div>'+ '</div>' + '<div class="bar">'+ '<div class="pPerc">'+seriesPerc+' of '+year+'</div>'+ '<div class="partialBar" style="width:'+seriesPerc+'"></div>'+ '</div>' + '</html>';   return html; } The first lines of the code are pretty similar except that we are using scene.getSeriesLabel() in place of scene.atoms.series.label. They do the same, so it's only different ways to get the values/labels. Then the total calculations that are calculated by iterating in all the elements of scene.chart().data._datums, which return the logical/relational table, a combination of the territory, years, and value. The last part is just to build the HTML with all the values and labels that we already got from the scene. There are multiple ways to get the values you need, for instance to customize the tooltip, you just need to explore the hierarchical structure of the scene and get used to it. The image that you are seeing also presents a different style, and that should be done using CSS. You can add CSS for your dashboard and change the style of the tooltip, not just the format. Styling tooltips When we want to style a tooltip, we may want to use the developer's tools to check the classes or names and CSS properties already applied, but it's hard because the popup does not stay still. We can change the tooltipDelayOut property and increase its default value from 80 to 1000 or more, depending on the time you need. When you want to apply some styles to the tooltips for a particular chart you can do by setting a CSS class on the tooltip. For that you should use the propertytooltipClassName and set the class name to be added and latter user on the CSS. Summary In this article,we provided a quick overview of how to use CCC in CDF and CDE dashboards and showed you what kinds of charts are available. We covered some of the base options as well as some advanced option that you might use to get a more customized visualization. Resources for Article: Further resources on this subject: Diving into OOP Principles [article] Python Scripting Essentials [article] Building a Puppet Module Skeleton [article]
Read more
  • 0
  • 0
  • 6801

article-image-scraping-data
Packt
21 Sep 2015
18 min read
Save for later

Scraping the Data

Packt
21 Sep 2015
18 min read
In this article by Richard Lawson, author of the book Web Scraping with Python, we will first cover a browser extension called Firebug Lite to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the article will conclude with a comparison of these three scraping alternatives. (For more resources related to this topic, see here.) Analyzing a web page To understand how a web page is structured, we can try examining the source code. In most web browsers, the source code of a web page can be viewed by right-clicking on the page and selecting the View page source option: The data we are interested in is found in this part of the HTML: <table> <tr id="places_national_flag__row"><td class="w2p_fl"><label for="places_national_flag" id="places_national_flag__label">National Flag: </label></td><td class="w2p_fw"><img src="/places/static/images/flags/gb.png" /></td><td class="w2p_fc"></td></tr> … <tr id="places_neighbours__row"><td class="w2p_fl"><label for="places_neighbours" id="places_neighbours__label">Neighbours: </label></td><td class="w2p_fw"><div><a href="/iso/IE">IE </a></div></td><td class="w2p_fc"></td></tr></table> This lack of whitespace and formatting is not an issue for a web browser to interpret, but it is difficult for us. To help us interpret this table, we will use the Firebug Lite extension, which is available for all web browsers at https://getfirebug.com/firebuglite. Firefox users can install the full Firebug extension if preferred, but the features we will use here are included in the Lite version. Now, with Firebug Lite installed, we can right-click on the part of the web page we are interested in scraping and select Inspect with Firebug Lite from the context menu, as shown here: This will open a panel showing the surrounding HTML hierarchy of the selected element: In the preceding screenshot, the country attribute was clicked on and the Firebug panel makes it clear that the country area figure is included within a <td> element of class w2p_fw, which is the child of a <tr> element of ID places_area__row. We now have all the information needed to scrape the area data. Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, firstly with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/2/howto/regex.html. To scrape the area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> url = 'http://example.webscraping.com/view/United Kingdom-239' >>> html = download(url) >>> re.findall('<td class="w2p_fw">(.*?)</td>', html) ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', '<a href="/continent/EU">EU</a>', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', '<div><a href="/iso/IE">IE </a></div>'] This result shows that the <td class="w2p_fw"> tag is used for multiple country attributes. To isolate the area, we can select the second element, as follows: >>> re.findall('<td class="w2p_fw">(.*?)</td>', html)[1] '244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if the website is updated and the population data is no longer available in the second table row. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data in future, we want our solution to be as robust against layout changes as possible. To make this regular expression more robust, we can include the parent <tr> element, which has an ID, so it ought to be unique: >>> re.findall('<tr id="places_area__row"><td class="w2p_fl"><label for="places_area" id="places_area__label">Area: </label></td><td class="w2p_fw">(.*?)</td>', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra space could be added between the <td> tags, or the area_label could be changed. Here is an improved version to try and support these various possiblilities: >>> re.findall('<tr id="places_area__row">.*?<tds*class=["']w2p_fw["']>(.*?) </td>', html)[0] '244,820 square kilometres' This regular expression is more future-proof but is difficult to construct, becoming unreadable. Also, there are still other minor layout changes that would break it, such as if a title attribute was added to the <td> tag. From this example, it is clear that regular expressions provide a simple way to scrape data but are too brittle and will easily break when a web page is updated. Fortunately, there are better solutions. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have it installed, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Most web pages do not contain perfectly valid HTML and Beautiful Soup needs to decide what is intended. For example, consider this simple web page of a list with missing attribute quotes and closing tags:       <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print fixed_html <html> <body> <ul class="country"> <li>Area</li> <li>Population</li> </ul> </body> </html> Here, BeautifulSoup was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] Beautiful Soup overview Here are the common methods and parameters you will use when scraping web pages with Beautiful Soup: BeautifulSoup(markup, builder): This method creates the soup object. The markup parameter can be a string or file object, and builder is the library that parses the markup parameter. find_all(name, attrs, text, **kwargs): This method returns a list of elements matching the given tag name, dictionary of attributes, and text. The contents of kwargs are used to match attributes. find(name, attrs, text, **kwargs): This method is the same as find_all(), except that it returns only the first match. If no element matches, it returns None. prettify(): This method returns the parsed HTML in an easy-to-read format with indentation and line breaks. For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the area from our example country: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/ United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag >>> area = td.text # extract the text from this tag >>> print area 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. Lxml Lxml is a Python wrapper on top of the libxml2 XML parsing library written in C, which makes it faster than Beautiful Soup but also harder to install on some computers. The latest installation instructions are available at http://lxml.de/installation.html. As with Beautiful Soup, the first step is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> import lxml.html >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = lxml.html.fromstring(broken_html) # parse the HTML >>> fixed_html = lxml.html.tostring(tree, pretty_print=True) >>> print fixed_html <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here and in future examples, because they are more compact. Also, some readers will already be familiar with them from their experience with jQuery selectors. Here is an example using the lxml CSS selectors to extract the area data: >>> tree = lxml.html.fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print area 244,820 square kilometres The key line with the CSS selector is highlighted. This line finds a table row element with the places_area__row ID, and then selects the child table data tag with the w2p_fw class. CSS selectors CSS selectors are patterns used for selecting elements. Here are some examples of common selectors you will need: Select any tag: * Select by tag <a>: a Select by class of "link": .link Select by tag <a> with class "link": a.link Select by tag <a> with ID "home": a#home Select by child <span> of tag <a>: a > span Select by descendant <span> of tag <a>: a span Select by tag <a> with attribute title of "Home": a[title=Home] The CSS3 specification was produced by the W3C and is available for viewing at http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. Lxml implements most of CSS3, and details on unsupported features are available at https://pythonhosted.org/cssselect/#supported-selectors. Note that, internally, lxml converts the CSS selectors into an equivalent XPath. Comparing performance To help evaluate the trade-offs of the three scraping approaches described in this article, it would help to compare their relative efficiency. Typically, a scraper would extract multiple fields from a web page. So, for a more realistic comparison, we will implement extended versions of each scraper that extract all the available data from a country's web page. To get started, we need to return to Firebug to check the format of the other country features, as shown here: Firebug shows that each table row has an ID starting with places_ and ending with __row. Then, the country data is contained within these rows in the same format as the earlier area example. Here are implementations that use this information to extract all of the available country data: FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') import re def re_scraper(html): results = {} for field in FIELDS: results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).groups()[0] return results from bs4 import BeautifulSoup def bs_scraper(html): soup = BeautifulSoup(html, 'html.parser') results = {} for field in FIELDS: results[field] = soup.find('table').find('tr', id='places_%s__row' % field).find('td', class_='w2p_fw').text return results import lxml.html def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() return results Scraping results Now that we have complete implementations for each scraper, we will test their relative performance with this snippet: import time NUM_ITERATIONS = 1000 # number of times to test each scraper html = download('http://example.webscraping.com/places/view/ United-Kingdom-239') for name, scraper in [('Regular expressions', re_scraper), ('BeautifulSoup', bs_scraper), ('Lxml', lxml_scraper)]: # record start time of scrape start = time.time() for i in range(NUM_ITERATIONS): if scraper == re_scraper: re.purge() result = scraper(html) # check scraped result is as expected assert(result['area'] == '244,820 square kilometres') # record end time of scrape and output the total end = time.time() print '%s: %.2f seconds' % (name, end – start) This example will run each scraper 1000 times, check whether the scraped results are as expected, and then print the total time taken. Note the highlighted line calling re.purge(); by default, the regular expression module will cache searches and this cache needs to be cleared to make a fair comparison with the other scraping approaches. Here are the results from this script on my computer: $ python performance.py Regular expressions: 5.50 seconds BeautifulSoup: 42.84 seconds Lxml: 7.06 seconds The results on your computer will quite likely be different because of the different hardware used. However, the relative difference between each approach should be equivalent. The results show that Beautiful Soup is over six times slower than the other two approaches when used to scrape our example web page. This result could be anticipated because lxml and the regular expression module were written in C, while BeautifulSoup is pure Python. An interesting fact is that lxml performed comparatively well with regular expressions, since lxml has the additional overhead of having to parse the input into its internal format before searching for elements. When scraping many features from a web page, this initial parsing overhead is reduced and lxml becomes even more competitive. It really is an amazing module! Overview The following table summarizes the advantages and disadvantages of each approach to scraping: Scraping approach Performance Ease of use Ease to install Regular expressions Fast Hard Easy (built-in module) Beautiful Soup Slow Easy Easy (pure Python) Lxml Fast Easy Moderately difficult If the bottleneck to your scraper is downloading web pages rather than extracting data, it would not be a problem to use a slower approach, such as Beautiful Soup. Or, if you just need to scrape a small amount of data and want to avoid additional dependencies, regular expressions might be an appropriate choice. However, in general, lxml is the best choice for scraping, because it is fast and robust, while regular expressions and Beautiful Soup are only useful in certain niches. Adding a scrape callback to the link crawler Now that we know how to scrape the country data, we can integrate this into the link crawler. To allow reusing the same crawling code to scrape multiple websites, we will add a callback parameter to handle the scraping. A callback is a function that will be called after certain events (in this case, after a web page has been downloaded). This scrape callback will take a url and html as parameters and optionally return a list of further URLs to crawl. Here is the implementation, which is simple in Python: def link_crawler(..., scrape_callback=None): … links = [] if scrape_callback: links.extend(scrape_callback(url, html) or []) … The new code for the scraping callback function are highlighted in the preceding snippet. Now, this crawler can be used to scrape multiple websites by customizing the function passed to scrape_callback. Here is a modified version of the lxml example scraper that can be used for the callback function: def scrape_callback(url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() for field in FIELDS] print url, row This callback function would scrape the country data and print it out. Usually, when scraping a website, we want to reuse the data, so we will extend this example to save results to a CSV spreadsheet, as follows: import csv class ScrapeCallback: def __init__(self): self.writer = csv.writer(open('countries.csv', 'w')) self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') self.writer.writerow(self.fields) def __call__(self, url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [] for field in self.fields: row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field)) [0].text_content()) self.writer.writerow(row) To build this callback, a class was used instead of a function so that the state of the csv writer could be maintained. This csv writer is instantiated in the constructor, and then written to multiple times in the __call__ method. Note that __call__ is a special method that is invoked when an object is "called" as a function, which is how the cache_callback is used in the link crawler. This means that scrape_callback(url, html) is equivalent to calling scrape_callback.__call__(url, html). For further details on Python's special class methods, refer to https://docs.python.org/2/reference/datamodel.html#special-method-names. This code shows how to pass this callback to the link crawler: link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=ScrapeCallback()) Now, when the crawler is run with this callback, it will save results to a CSV file that can be viewed in an application such as Excel or LibreOffice: Success! We have completed our first working scraper. Summary In this article, we walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, and we will use it in future examples. Resources for Article: Further resources on this subject: Scientific Computing APIs for Python [article] Bizarre Python [article] Optimization in Python [article]
Read more
  • 0
  • 0
  • 6792
article-image-working-with-sparks-graph-processing-library-graphframes
Pravin Dhandre
11 Jan 2018
12 min read
Save for later

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre
11 Jan 2018
12 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book by Rajanarayanan Thottuvaikkatumana titled, Apache Spark 2 for Beginners. The author presents a learners guide for python and scala developers to develop large-scale and distributed data processing applications in the business environment.[/box] In this post we will see how a Spark user can work with Spark’s most popular graph processing package, GraphFrames. Additionally explore how you can benefit from running queries and finding insightful patterns through graphs. The Spark GraphX library is the graph processing library that has the least programming language support. Scala is the only programming language supported by the Spark GraphX library. GraphFrames is a new graph processing library available as an external Spark package developed by Databricks, University of California, Berkeley, and Massachusetts Institute of Technology, built on top of Spark DataFrames. Since it is built on top of DataFrames, all the operations that can be done on DataFrames are potentially possible on GraphFrames, with support for programming languages such as Scala, Java, Python, and R with a uniform API. Since GraphFrames is built on top of DataFrames, the persistence of data, support for numerous data sources, and powerful graph queries in Spark SQL are additional benefits users get for free. Just like the Spark GraphX library, in GraphFrames the data is stored in vertices and edges. The vertices and edges use DataFrames as the data structure. The first use case covered in the beginning of this chapter is used again to elucidate GraphFrames-based graph processing. Please make a note that GraphFrames is an external Spark package. It has some incompatibility with Spark 2.0. Because of that, the following code snippets will not work with  park 2.0. They work with Spark 1.6. Refer to their website to check Spark 2.0 support. At the Scala REPL prompt of Spark 1.6, try the following statements. Since GraphFrames is an external Spark package, while bringing up the appropriate REPL, the library has to be imported and the following command is used in the terminal prompt to fire up the REPL and make sure that the library is loaded without any error messages: $ cd $SPARK_1.6__HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 153ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/31 09:22:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val users = sqlContext.createDataFrame(List((1L, "Thomas"),(2L, "Krish"),(3L, "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: bigint, name: string] scala> //Created a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List((1L, 2L, "Follows"),(1L, 2L, "Son"),(2L, 3L, "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint, relationship: string] scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, name: string], e:[src: bigint, dst: bigint, relationship: string]) scala> // Vertices in the graph scala> userGraph.vertices.show() +---+------+ | id| name| +---+------+ | 1|Thomas| | 2| Krish| | 3|Mathew| +---+------+ scala> // Edges in the graph scala> userGraph.edges.show() +---+---+------------+ |src|dst|relationship| +---+---+------------+ | 1| 2| Follows| | 1| 2| Son| | 2| 3| Follows| +---+---+------------+ scala> //Number of edges in the graph scala> val edgeCount = userGraph.edges.count() edgeCount: Long = 3 scala> //Number of vertices in the graph scala> val vertexCount = userGraph.vertices.count() vertexCount: Long = 3 scala> //Number of edges coming to each of the vertex. scala> userGraph.inDegrees.show() +---+--------+ | id|inDegree| +---+--------+ | 2| 2| | 3| 1| +---+--------+ scala> //Number of edges going out of each of the vertex. scala> userGraph.outDegrees.show() +---+---------+ | id|outDegree| +---+---------+ | 1| 2| | 2| 1| +---+---------+ scala> //Total number of edges coming in and going out of each vertex. scala> userGraph.degrees.show() +---+------+ | id|degree| +---+------+ | 1| 2| | 2| 3| | 3| 1| +---+------+ scala> //Get the triplets of the graph scala> userGraph.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> //Using the DataFrame API, apply filter and select only the needed edges scala> val numFollows = userGraph.edges.filter("relationship = 'Follows'").count() numFollows: Long = 2 scala> //Create an RDD of users containing tuple values with a mandatory Long and another String type as the property of the vertex scala> val usersRDD: RDD[(Long, String)] = sc.parallelize(Array((1L, "Thomas"), (2L, "Krish"),(3L, "Mathew"))) usersRDD: org.apache.spark.rdd.RDD[(Long, String)] = ParallelCollectionRDD[54] at parallelize at <console>:35 scala> //Created an RDD of Edge type with String type as the property of the edge scala> val userRelationshipsRDD: RDD[Edge[String]] = sc.parallelize(Array(Edge(1L, 2L, "Follows"), Edge(1L, 2L, "Son"),Edge(2L, 3L, "Follows"))) userRelationshipsRDD: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[55] at parallelize at <console>:35 scala> //Create a graph containing the vertex and edge RDDs as created before scala> val userGraphXFromRDD = Graph(usersRDD, userRelationshipsRDD) userGraphXFromRDD: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@77a3c614 scala> //Create the GraphFrame based graph from Spark GraphX based graph scala> val userGraphFrameFromGraphX: GraphFrame = GraphFrame.fromGraphX(userGraphXFromRDD) userGraphFrameFromGraphX: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: string], e:[src: bigint, dst: bigint, attr: string]) scala> userGraphFrameFromGraphX.triplets.show() +-------------+----------+----------+ | edge| src| dst| +-------------+----------+----------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]| | [1,2,Son]|[1,Thomas]| [2,Krish]| |[2,3,Follows]| [2,Krish]|[3,Mathew]| +-------------+----------+----------+ scala> // Convert the GraphFrame based graph to a Spark GraphX based graph scala> val userGraphXFromGraphFrame: Graph[Row, Row] = userGraphFrameFromGraphX.toGraphX userGraphXFromGraphFrame: org.apache.spark.graphx.Graph[org.apache.spark.sql.Row,org.apache.spark.sql .Row] = org.apache.spark.graphx.impl.GraphImpl@238d6aa2 When creating DataFrames for the GraphFrame, the only thing to keep in mind is that there are some mandatory columns for the vertices and the edges. In the DataFrame for vertices, the id column is mandatory. In the DataFrame for edges, the src and dst columns are mandatory. Apart from that, any number of arbitrary columns can be stored with both the vertices and the edges of a GraphFrame. In the Spark GraphX library, the vertex identifier must be a long integer, but the GraphFrame doesn't have any such limitations and any type is supported as the vertex identifier. Readers should already be familiar with DataFrames; any operation that can be done on a DataFrame can be done on the vertices and edges of a GraphFrame. All the graph processing algorithms supported by Spark GraphX are supported by GraphFrames as well. The Python version of GraphFrames has fewer features. Since Python is not a supported programming language for the Spark GraphX library, GraphFrame to GraphX and GraphX to GraphFrame conversions are not supported in Python. Since readers are familiar with the creation of DataFrames in Spark using Python, the Python example is omitted here. Moreover, there are some pending defects in the GraphFrames API for Python and not all the features demonstrated previously using Scala function properly in Python at the time of writing   Understanding GraphFrames queries The Spark GraphX library is the RDD-based graph processing library, but GraphFrames is a Spark DataFrame-based graph processing library that is available as an external package. Spark GraphX supports many graph processing algorithms, but GraphFrames supports not only graph processing algorithms, but also graph queries. The major difference between graph processing algorithms and graph queries is that graph processing algorithms are used to process the data hidden in a graph data structure, while graph queries are used to search for patterns in the data hidden in a graph data structure. In GraphFrame parlance, graph queries are also known as motif finding. This has tremendous applications in genetics and other biological sciences that deal with sequence motifs. From a use case perspective, take the use case of users following each other in a social media application. Users have relationships between them. In the previous sections, these relationships were modeled as graphs. In real-world use cases, such graphs can become really huge, and if there is a need to find users with relationships between them in both directions, it can be expressed as a pattern in graph query, and such relationships can be found using easy programmatic constructs. The following demonstration models the relationship between the users in a GraphFrame, and a pattern search is done using that. At the Scala REPL prompt of Spark 1.6, try the following statements: $ cd $SPARK_1.6_HOME $ ./bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6 Ivy Default Cache set to: /Users/RajT/.ivy2/cache The jars for the packages stored in: /Users/RajT/.ivy2/jars :: loading settings :: url = jar:file:/Users/RajT/source-code/sparksource/spark-1.6.1/assembly/target/scala-2.10/spark-assembly-1.6.2- SNAPSHOT-hadoop2.2.0.jar!/org/apache/ivy/core/settings/ivysettings.xml graphframes#graphframes added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0 confs: [default] found graphframes#graphframes;0.1.0-spark1.6 in list :: resolution report :: resolve 145ms :: artifacts dl 2ms :: modules in use: graphframes#graphframes;0.1.0-spark1.6 from list in [default] --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 1 | 0 | 0 | 0 || 1 | 0 | --------------------------------------------------------------------- :: retrieving :: org.apache.spark#spark-submit-parent confs: [default] 0 artifacts copied, 1 already retrieved (0kB/5ms) 16/07/29 07:09:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /___/ .__/_,_/_/ /_/_ version 1.6.1 /_/ Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_66) Type in expressions to have them evaluated. Type :help for more information. Spark context available as sc. SQL context available as sqlContext. scala> import org.graphframes._ import org.graphframes._ scala> import org.apache.spark.rdd.RDD import org.apache.spark.rdd.RDD scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> //Create a DataFrame of users containing tuple values with a mandatory String field as id and another String type as the property of the vertex. Here it can be seen that the vertex identifier is no longer a long integer. scala> val users = sqlContext.createDataFrame(List(("1", "Thomas"),("2", "Krish"),("3", "Mathew"))).toDF("id", "name") users: org.apache.spark.sql.DataFrame = [id: string, name: string] scala> //Create a DataFrame for Edge with String type as the property of the edge scala> val userRelationships = sqlContext.createDataFrame(List(("1", "2", "Follows"),("2", "1", "Follows"),("2", "3", "Follows"))).toDF("src", "dst", "relationship") userRelationships: org.apache.spark.sql.DataFrame = [src: string, dst: string, relationship: string] scala> //Create the GraphFrame scala> val userGraph = GraphFrame(users, userRelationships) userGraph: org.graphframes.GraphFrame = GraphFrame(v:[id: string, name: string], e:[src: string, dst: string, relationship: string]) scala> // Search for pairs of users who are following each other scala> // In other words the query can be read like this. Find the list of users having a pattern such that user u1 is related to user u2 using the edge e1 and user u2 is related to the user u1 using the edge e2. When a query is formed like this, the result will list with columns u1, u2, e1 and e2. When modelling real-world use cases, more meaningful variables can be used suitable for the use case. scala> val graphQuery = userGraph.find("(u1)-[e1]->(u2); (u2)-[e2]->(u1)") graphQuery: org.apache.spark.sql.DataFrame = [e1: struct<src:string,dst:string,relationship:string>, u1: struct<id:string,name:string>, u2: struct<id:string,name:string>, e2: struct<src:string,dst:string,relationship:string>] scala> graphQuery.show() +-------------+----------+----------+-------------+ | e1| u1| u2| e2| +-------------+----------+----------+-------------+ |[1,2,Follows]|[1,Thomas]| [2,Krish]|[2,1,Follows]| |[2,1,Follows]| [2,Krish]|[1,Thomas]|[1,2,Follows]| +-------------+----------+----------+-------------+ Note that the columns in the graph query result are formed with the elements given in the search pattern. There is no limit to the way the patterns can be formed. Note the data type of the graph query result. It is a DataFrame object. That brings a great flexibility in processing the query results using the familiar Spark SQL library. The biggest limitation of the Spark GraphX library is that its API is not supported with popular programming languages such as Python and R. Since GraphFrames is a DataFrame based library, once it matures, it will enable graph processing in all the programming languages supported by DataFrames. Spark external package is definitely a potential candidate to be included as part of the Spark. To know more on the design and development of a data processing application using Spark and the family of libraries built on top of it, do check out this book Apache Spark 2 for Beginners.  
Read more
  • 0
  • 0
  • 6789

article-image-ridge-regression
Packt
16 Dec 2014
9 min read
Save for later

Ridge Regression

Packt
16 Dec 2014
9 min read
In this article by Patrick R. Nicolas, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization. (For more resources related to this topic, see here.) Ln roughness penalty Regularization consists of adding a penalty function J(w) to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function: The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) wd. For a model of D dimension the generic Lp-norm is defined as follows: Notation Regularization applies to parameters or weights associated to an observation. In order to be consistent with our notation w0 being the intercept value, the regularization applies to the parameters w1 …wd. The two most commonly used penalty functions for regularization are L1 and L2. Regularization in machine learning The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS. The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty. You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the 13.3 L1 regularization: basics section in the book Machine Learning: A Probabilistic Perspective, and the Feature selection, L1 vs. L2 regularization, and rotational invariance paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf). The various differences between the two regularizations are as follows: Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For large non-sparse dataset, L2 has a smaller estimation error than L1. Feature selection: L1 is more effective in reducing the regression weights for features with high value than L2. Therefore, L1 is a reliable features selection tool. Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features. Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty w2 is a continuous and differentiable function for which the first and second derivative can be computed (convex minimization). The L1 term is the summation of |wi|, and therefore, not differentiable. Terminology The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay. Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor. Ridge regression The ridge regression is a multivariate linear regression with a L2 norm penalty term, and can be calculated as follows: The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression. Matrix representation of ridge regression closed form is as follows: I is the identity matrix and it is using the QR decomposition, as shown here: Implementation The implementation of the ridge regression adds L2 regularization term to the multiple linear regression computation of the Apache Commons Math library. The methods of RidgeRegression have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class AbstractMultipleLinearRegression in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code: class RidgeRegression[T <% Double](val xt: XTSeries[Array[T]],                                    val y: DblVector,                                   val lambda: Double) {                   extends AbstractMultipleLinearRegression                    with PipeOperator[Array[T], Double] {    private var qr: QRDecomposition = null    private[this] val model: Option[RegressionModel] = …    … } Besides the input time series xt and the labels y, the ridge regression requires the lambda factor of the L2 penalty term. The instantiation of the class train the model. The steps to create the ridge regression models are as follows: Extract the Q and R matrices for the input values, newXSampleData (line 1) Compute the weights using the calculateBeta defined in the base class (line 2) Return the tuple regression weights calculateBeta and the residuals calculateResiduals private val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) } The QR decomposition in the AbstractMultipleLinearRegression base class does not include the penalty term (line 3); the identity matrix with lambda factor in the diagonal has to be added to the matrix to be decomposed (line 4). override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x)   //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) + lambda)) //4 qr = new QRDecomposition(xtx) } The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the calculateBeta function from the base class: override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) Test case The objective of the test case is to identify the impact of the L2 penalization on the RSS value, and then compare the predicted values with original values. Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression: val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1   val _price = price.get.toArray val deltaPrice = XTSeries[Double](_price                                .drop(1)                                .zip(_price.take(_price.size -1))                                .map( z => z._1 - z._2)) //2 val data = volatility.get                      .zip(volume.get)                      .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4   regression.rss match { case Some(rss) => Display.show(rss, logger) …. The observed data, ETF daily price, and the features (volatility and volume) are extracted from the source src (line 1). The daily price change, deltaPrice, is computed using a combination of Scala take and drop methods (line 2). The features vector is created by zipping volatility and volume (line 3). The model is created by instantiating the RidgeRegression class (line 4). The RSS value, rss, is finally displayed (line 5). The RSS value, rss, is plotted for different values of lambda <= 1.0 in the following graph: Graph of RSS versus Lambda for Copper ETF The residual sum of squares decreased as λ increases. The curve seems to be reaching for a minimum around λ=1. The case of λ = 0 corresponds to the least squares regression. Next, let's plot the RSS value for λ varying between 1 and 100: Graph RSS versus large value Lambda for Copper ETF This time around RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings (refer to Lecture 5: Model selection and assessment, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases. The regression weights can by simply outputted as follows: regression.weights.get Let's plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (λ): Graph of ridge regression on Copper ETF price variation with variable Lambda The original price variation of the Copper ETF Δ = price(t+1)-price(t) is plotted as λ =0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 0.8 follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced. The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F1 measure to confirm the findings. Summary The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [Article] Dependency Management in SBT [Article] Introduction to MapReduce [Article]
Read more
  • 0
  • 0
  • 6786
Modal Close icon
Modal Close icon