Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-designing-user-security-oracle-peoplesoft-applications

04 Jul 2011

12 min read

Designing User Security for Oracle PeopleSoft Applications

04 Jul 2011

Understanding User security Before we get into discussing the PeopleSoft security, let's spend some time trying to set the context for user security. Whenever we think of a complex system like PeopleSoft Financial applications with potentially hundreds of users, the following are but a few questions that face us: Should a user working in billing group have access to transactions, such as vouchers and payments, in Accounts Payable? Should a user who is a part of North America business unit have access to the data belonging to the Europe business unit? Should a user whose job involves entering vouchers be able to approve and pay them as well? Should a data entry clerk be able to view departmental budgets for the organization? These questions barely scratch the surface of the complex security considerations of an organization. Of course, there is no right or wrong answer for such questions, as every organization has its own unique security policies. What is more important is the fact that we need a mechanism that can segregate the access to system features. We need to enforce appropriate controls to ensure users can access only the features they need. Implementation challenge Global Vehicles' Accounts Payable department has three different types of users – Mary, who is the AP manager; Amy, who is the AP Supervisor; and Anton, who is the AP Clerk. These three types of users need to have the following access to PeopleSoft features: User typeAccess to featureDescriptionAP ClerkVoucher entryA clerk should be able to access system pages to enter vouchers from various vendors.AP SupervisorVoucher entry Voucher approval Voucher postingA supervisor also needs to have access to enter vouchers. He/she should review and approve each voucher entered by the clerk. Also, the supervisor should be able to execute the Voucher Post process that creates accounting entries for vouchers.AP ManagerPay cycle Voucher approvalAP manager should be the only one who can execute the Pay Cycle (a process that prints checks to issue payments to vendors). Manager (in addition to the Supervisor) should also have the authority to approve vouchers. Note that this is an extremely simplified scenario that does not really include all the required features in Accounts Payable. Solution We will design a security matrix that uses distinct security roles. We'll configure permission lists, roles, and user profiles to limit user access to required system features. PeopleSoft security matrix is a three-level structure consisting of Permission lists (at the bottom), Roles (in the middle) and User profiles (at the top). The following illustration shows how it is structured: We need to create a User Profile for each user of the system. This user profile can have as many Roles as needed. For example, a user can have roles of Billing Supervisor and Accounts Receivable Payment Processor, if he/she approves customer invoices as well as processes customer payments. Thus, the number of roles that a user should be granted depends entirely on his/her job responsibilities. Each role can have multiple permission lists. A Permission list determines which features can be accessed by a Role. We can specify which pages can be accessed, the mode in which they can be accessed (read only/add/update) in a permission list. In a nutshell, we can think of a Role as a grouping of system features that a user needs to access, while a Permission list defines the nature of access to those system features. Expert tip Deciding how many and which roles should be created needs quite a bit of thinking about. How easy will it be to maintain them in future? Think of the scenarios where a function needs to be removed from a role and added to another. How easy would it be to do so? As a rule of thumb, you should map system features to roles in such a way that they are non-overlapping. Similarly, map system pages to permission lists so that they are mutually exclusive. This simplifies user security maintenance to a large extent. Note that although it is advisable, it may not always be possible. Organizational requirements are sometimes too complicated to achieve this. However, a PeopleSoft professional should try to build a modular security design to the extent possible. Now, let's try to design our security matrix for the hypothetical scenario presented previously and test our thumb rule of mutually exclusive roles and permission lists. What do you observe as far as required system features for a role are concerned? You are absolutely right if you thought that some of system features (such as Voucher entry) are common across roles. Which roles should we design for this situation? Going by the principle of mutually exclusive roles, we can map system features to the permission lists (and in turn roles) without overlapping them. We'll denote our roles by the prefix 'RL' and permission lists by the prefix 'PL'. Thus, the mapping may look something like this: RolePermission listSystem featureRL_Voucher_EntryPL_Voucher_EntryVoucher EntryRL_Voucher_ApprovalPL_Voucher_ApprovalVoucher ApprovalRL_Voucher_PostingPL_Voucher_PostingVoucher PostingRL_Pay_CyclePL_Pay_CyclePay Cycle So, now we have created the required roles and permission lists, and attached appropriate permission lists to each of the roles. In this example, due to the simple scenario, each role has only a single permission list assigned to it. Now as the final step, we'll assign appropriate roles to each user's User profile. UserRoleSystem feature accessedMaryRL_Voucher_ApprovalVoucher ApprovalRL_Pay_CyclePay CycleAmyRL_Voucher_EntryVoucher EntryRL_Voucher_ApprovalVoucher ApprovalRL_Voucher_PostingVoucher PostingAntonRL_Voucher_EntryVoucher Entry Now, as you can see, each user has access to the appropriate system feature through the roles and in turn, permission lists attached to their user profiles. Can you think of the advantage of our approach? Let's say that a few months down the line, it is decided that Mary (AP Manager) should be the only one approving vouchers, while Amy (AP Supervisor) should also have the ability to execute pay cycles and issue payments to vendors. How can we accommodate this change? It's quite simple really – we'll remove the role RL_Voucher_Approval from Amy's user profile and add the role RL_Pay_Cycle to her profile. So now the security matrix will look like this: UserRoleSystem feature accessedMaryRL_Voucher_ApprovalVoucher ApprovalRL_Pay_CyclePay CycleAmyRL_Voucher_EntryVoucher EntryRL_Pay_CyclePay CycleRL_Voucher_PostingVoucher PostingAntonRL_Voucher_EntryVoucher Entry Thus, security maintenance becomes less cumbersome when we design roles and permission lists with non-overlapping system features. Of course, this comes with a downside as well. This approach results in a large number of roles and permission lists, thereby increasing the initial configuration effort. The solution that we actually design for an organization needs to balance these two objectives. Expert tip Having too many permission lists assigned to a User Profile can adversely affect the system performance. PeopleSoft recommends 10-20 permission lists per user. Configuring permission lists Follow this navigation to configure permission lists: PeopleTools | Security | Permissions & Roles | Permission Lists The following screenshot shows the General tab of the Permission List set up page. We can specify the permission list description on this tab. We'll go through some of the important configuration activities for the Voucher Entry permission list discussed previously. Can Start Application Server?: Selecting this field enables a user with this permission list to start PeopleSoft application servers (all batch processes are executed on this server). A typical system user does not need this option. Allow Password to be Emailed?: Selecting this field enables users to receive forgotten passwords through e-mail. Leave the field unchecked to prevent unencrypted passwords from being sent in e-mails. Time-out Minutes – Never Time-out and Specific Time-out: These fields determine the number of minutes of inactivity after which the system automatically logs out the user with this permission list. The following screenshot shows the Pages tab of the permission list page. This is the most important place where we specify the menus, components, and ultimately the pages that a user can access. In PeopleSoft parlance, a component means collection of pages that are related to each other. In the previous screenshot, you are looking at a page. You can also see other related pages used to configure permission lists – General, PeopleTools, Process, and so on. All these pages constitute a component. A menu is a collection of various components. It is typically related to a system feature, such as 'Enter Voucher Information' as you can see in the screenshot. Thus, to grant a page access, we need to enter its component and menu details. Expert tip In order to find out the component and menu for a page, press CTRL+J when it is open in the internet browser. Menu Name: This is where we need to specify all the menus to which a user needs to have access. Note that a permission list can grant access to multiple menus, but our simple example includes only one system feature (Voucher entry) and in turn, only one menu. Click the + button to add and select more menus. Edit Component hyperlink: Once we select a menu, we need to specify which components under it should be accessed by the permission list. Click the Edit Components hyperlink to proceed. The system opens the Component Permissions page, as shown in the following screenshot: In this page, the system shows all components under the selected menu. The previous screenshot shows only a part of the component list under the 'Enter Voucher Information' menu. Voucher entry pages for which we need to grant access exist under a component named VCHR_EXPRESS. Click the Edit Pages hyperlink to grant access to specific pages under a given component. The system opens the Page Permissions page, as shown in the following screenshot: The given screenshot shows the partial list of pages under the Voucher Entry component. Panel Item Name: This field shows the page name to which access is to be granted. Authorized?: In order to enable a user to access a page, select this checkbox. As you can see, we have authorized this permission list to access six pages in this component. Display Only: Clicking this checkbox allows the user to get read-only access for the given page. He/she cannot make any changes to the data on this page. Add: Selecting this checkbox enables the user to add a new transaction (in this case new vouchers). Update/Display: Selecting this checkbox enables the user to retrieve the current effective dated row. He/she can also add future effective dated rows in addition to modifying them. Update/Display All: This option gives the user all of the privileges of the Update/Display option. In addition he/she can retrieve past effective dated rows as well. Correction: This option enables the user to perform all the possible operations; that is, to retrieve, modify or add past, current, and future effective dated rows. Effective dates are usually relevant for master data set ups. It drives when a particular value comes into effect. A vendor may be set up with effective date of 1/1/2001, so thatit comes into effect from that date. Now assume that its name is slated to change on 1/1/2012. However, we can add a new vendor row with the new name and the appropriate effective date. The system automatically starts using the new name from 1/1/2012. Note that there can be multiple future effective dated rows, but only one current row. The next tab PeopleTools contains configuration options that are more relevant for technical developers. As we are concentrating on business users of the system, we'll not discuss them. Click the Process tab. As shown in the following screenshot, this tab is used to configure options for Process groups. A process group is a collection of batch processes belonging to a specific internal department or a business process. For example, PeopleSoft delivers a process group ARALL that includes all Accounts Receivable batch processes. PeopleSoft delivers various pre-configured process groups; however, we can create our own process groups depending on the organization's requirements. Click the Process Group Permissions hyperlink. The system opens a new page where we can select as many process groups as needed. When a process group is selected for a permission list, it enables the users to execute batch and online processes that are part of it. The following screenshot shows the Sign-on Times tab. This tab controls the time spans when a user with this permission list can sign-on to the system. We can enforce specific days or specific time spans for a particular day when users can sign-on. In the case of our permission list, there are no such limits and users with this permission list will be able to sign-on anytime on all days of the week. The next tab on this page is the Component Interface. The component interface is a PeopleSoft utility that automates bulk data entry into PeopleSoft pages. We can select component interface values on this tab, so that users with this permission list have access rights to use them. Due to the highly technical nature of the activities involved, we will not discuss the Web Libraries, Web Services, Mass change, Links, and Audit tabs. Oracle offers exhaustive documentation on PeopleSoft applications at the following here. The next important tab on the permission list page is Query. On this tab, the system shows two hyperlinks: Access Group Permissions and Query Profile. Click the Access Group Permissions hyperlink. The system opens the Permission List Access Groups page, as shown in the next screenshot. This page controls which database tables can be accessed by users to create database queries, using a PeopleSoft tool called Query Manager. Tree Name: A tree is a hierarchical structure of database tables. As shown in the screenshot, the tree QUERY_TREE_AP groups all AP tables. Access Group: Each tree has multiple nodes called access groups. These access groups are just logical groups of tables within a tree. In the screenshot, VOUCHERS is a group of voucher-related tables within the AP table tree. With this configuration, users will be able to create database queries on voucher tables in AP. Using the Query Profile hyperlink, we can set various options that control how users can create queries (such as if he/she can use joins, unions, and so on in queries, how many database rows can be fetched by the query, if the user can copy the query output to Microsoft Excel, and so on).

0
0
7038

article-image-flink-complex-event-processing

Packt

16 Jan 2017

13 min read

Flink Complex Event Processing

Packt

16 Jan 2017

13 min read

0
0
7037

article-image-sap-businessobjects-customizing-dashboard

Packt

27 May 2011

4 min read

SAP BusinessObjects: Customizing the Dashboard

Packt

27 May 2011

4 min read

SAP BusinessObjects Dashboards 4.0 Cookbook Over 90 simple and incredibly effective recipes for transforming your business data into exciting dashboards with SAP BusinessObjects Dashboards 4.0 Xcelsius Introduction In this article, we will go through certain techniques on how you can utilize the different cosmetic features Dashboard Design provides, in order to improve the look of your dashboard. Dashboard Design provides a powerful way to capture the audience versus other dashboard tools. It allows developers to build dashboards with the important 'wow' factor that other tools lack. Let's take, for example, two dashboards that have the exact same functionality, placement of charts, and others. However, one dashboard looks much more attractive than the other. In general, people looking at the nicer looking dashboard will be more interested and thus get more value of the data that comes out of it. Thus, not only does Dashboard Design provide a powerful and flexible way of presenting data, but it also provides the 'wow' factor to capture a user's interest. Changing the look of a chart This recipe will run through changing the look of a chart. Particularly, it will go through each tab in the appearance icon of the chart properties. We will then make modifications and see the resulting changes. Getting ready Insert a chart object onto the canvas. Prepare some data and bind it to the chart. How to do it... Double-click/right-click on the chart object on the canvas/object properties window to go into Chart Properties. In the Layout tab, uncheck Show Chart Background. (Move the mouse over the image to enlarge.) In the Series tab, click on the colored square box circled in the next screenshot to change the color of the bar to your desired color. Then change the width of each bar; click on the Marker Size area and change it to 35. Click on the colored boxes circled in red in the Axes tab and choose dark blue to modify the horizontal and vertical axes separately. Uncheck Show Minor Gridlines at the bottom so that we remove all the horizontal lines in between each of the major gridlines. Next, go to the Text and Color tabs, where you can make changes to all the different text areas of the chart. How it works... As you can see, the default chart looks plain and the bars are skinny so it's harder to visualize things. It is a good idea to remove the chart background if there is an underlying background so that the chart blends in better. In addition, the changes to the chart colors and text provide additional aesthetics that help improve the look of the chart. Adding a background to your dashboard This recipe shows the usefulness of backgrounds in the dashboard. It will show how backgrounds can help provide additional depth to objects and help to group certain areas together for better visualization. Getting ready Make sure you have all your objects such as charts and selectors ready on the canvas. Here's an example of the two charts before the makeover. Bind some data to the charts if you want to change the coloring of the series How to do it... Choose Background4 from the Art and Backgrounds tab of the Components window. Stretch the background so that it fills the size of the canvas. Make sure that ordering of the backgrounds is before the charts. To change the ordering of the background, go to the object browser, select the background object and then press the "-" key until the background object is behind the chart. Select Background1 from the Art and Backgrounds tab and put two of them under the charts, as shown in the following screenshot: When the backgrounds are in the proper place, open the properties window for the backgrounds and set the background color to your desired color. In this example we picked turquoise blue for each background. How it works... As you can see with the before and after pictures, having backgrounds can make a huge difference in terms of aesthetics. The objects are much more pleasant to look at now and there is certainly a lot of depth with the charts. The best way to choose the right backgrounds that fit your dashboard is to play around with the different background objects and their colors. If you are not very artistic, you can come up with a bunch of examples and demonstrate it to the business user to see which one they prefer the most.

0
0
7021

article-image-the-eu-commission-introduces-guidelines-for-achieving-a-trustworthy-ai

Savia Lobo

09 Apr 2019

4 min read

The EU commission introduces guidelines for achieving a ‘Trustworthy AI’

Savia Lobo

09 Apr 2019

4 min read

On the third day of the Digital Day 2019 held in Brussels, the European Commission introduced a set of essential guidelines for building a trustworthy AI, which will guide companies and government to build ethical AI applications. By introducing these new guidelines, the commission is working towards a three-step approach including, Setting out the key requirements for trustworthy AI Launching a large scale pilot phase for feedback from stakeholders Working on international consensus building for human-centric AI EU’s high-level expert group on AI, which consists of 52 independent experts representing academia, industry, and civil society, came up with seven requirements, which according to them, the future AI systems should meet. Seven guidelines for achieving an ethical AI Human agency and oversight: AI systems should enable equitable societies by supporting human agency and fundamental rights, and not decrease, limit or misguide human autonomy. Robustness and safety: A trustworthy AI requires algorithms to be secure, reliable and robust enough to deal with errors or inconsistencies during all life cycle phases of AI systems. Privacy and data governance: Citizens should have full control over their own data, while data concerning them will not be used to harm or discriminate against them. Transparency: The traceability of AI systems should be ensured. Diversity, non-discrimination, and fairness: AI systems should consider the whole range of human abilities, skills and requirements, and ensure accessibility. Societal and environmental well-being: AI systems should be used to enhance positive social change and enhance sustainability and ecological responsibility. Accountability: Mechanisms should be put in place to ensure responsibility and accountability for AI systems and their outcomes. According to EU’s official press release, “Following the pilot phase, in early 2020, the AI expert group will review the assessment lists for the key requirements, building on the feedback received. Building on this review, the Commission will evaluate the outcome and propose any next steps.” The plans fall under the Commission’s AI strategy of April 2018, which “aims at increasing public and private investments to at least €20 billion annually over the next decade, making more data available, fostering talent and ensuring trust ”, the press release states. Andrus Ansip, Vice-President for the Digital Single Market, said, “The ethical dimension of AI is not a luxury feature or an add-on. It is only with trust that our society can fully benefit from technologies. Ethical AI is a win-win proposition that can become a competitive advantage for Europe: being a leader of human-centric AI that people can trust.” Mariya Gabriel, Commissioner for Digital Economy and Society, said, “We now have a solid foundation based on EU values and following an extensive and constructive engagement from many stakeholders including businesses, academia and civil society. We will now put these requirements to practice and at the same time foster an international discussion on human-centric AI." Thomas Metzinger, a Professor of Theoretical Philosophy at the University of Mainz and who was also a member of the commission's expert group that has worked on the guidelines has put forward an article titled, ‘Ethics washing made in Europe’. Metzinger said he has worked on the Ethics Guidelines for nine months. “The result is a compromise of which I am not proud, but which is nevertheless the best in the world on the subject. The United States and China have nothing comparable. How does it fit together?”, he writes. Eline Chivot, a senior policy analyst at the Center for Data Innovation think tank, told The Verge, “We are skeptical of the approach being taken, the idea that by creating a golden standard for ethical AI it will confirm the EU’s place in global AI development. To be a leader in ethical AI you first have to lead in AI itself.” To know more about this news in detail, read the EU press release. Is Google trying to ethics-wash its decisions with its new Advanced Tech External Advisory Council? IEEE Standards Association releases ethics guidelines for automation and intelligent systems Sir Tim Berners-Lee on digital ethics and socio-technical systems at ICDPPC 2018

0
0
7009

Packt

29 Mar 2010

2 min read

Setting Up the iReport Pages

Packt

29 Mar 2010

2 min read

Configuring the page format We can follow the listed steps for setting up report pages: Open the report List of Products. Go to menu Window | Report Inspector. The following window will appear on the left side of the report designer: Select the report List of Products, right-click on it, and choose Page Format…. The Page format… dialog box will appear, select A4 from the Format drop-down list, and select Portrait from the Page orientation section. You can modify the page margins if you need to, or leave it as it is to have the default margins. For our report, you need not change the margins. Press OK. Page size You have seen that there are many preset sizes/formats for the report, such as Custom, Letter, Note, Legal, A0 to A10, B0 to B5, and so on. You will choose the appropriate one based on your requirements. We have chosen A4. If the number of columns is too high to fit in Portrait, then choose the Landscape orientation. If you change the preset sizes, the report elements (title, column heading, fields, or other elements) will not be positioned automatically according to the new page size. You have to position each element manually. So be careful if you decide to change the page size. Configuring properties We can modify the default settings of report properties in the following way: Right-click on List of Products and choose Properties. We can configure many important report properties from the Properties window. You can see that there are many options here. You can change the Report name, Page size, Margins, Columns, and more. We have already learnt about setting up pages, so now our concern is to learn about some of the other (More…) options.

0
0
7006

Packt

04 Nov 2015

20 min read

Introduction to Couchbase

Packt

04 Nov 2015

20 min read

In this article by Henry Potsangbam, the author of the book Learning Couchbase, we will learn that Couchbase is a NoSQL nonrelational database management system, which is different from traditional relational database management systems in many significant ways. It is designed for distributed data stores in which there are very large-scale data storage requirements (terabytes and petabytes of data). These types of data storing mechanisms might not require fixed schemas, avoid join operations, and typically scale horizontally. The main feature of Couchbase is that it is schemaless. There is no fixed schema to store data. Also, there is no join between one or more data records or documents. It allows distributed storage and utilizes computing resources, such as CPU and RAM, spanning across the nodes that are part of the Couchbase cluster. Couchbase databases provide the following benefits: It provides a flexible data model. You don't need to worry about the schema. You can design your schema depending on the needs of your application domain and not by storage demands. It's scalable and can be done very easily. Since it's a distributing system, it can scale out horizontally without too many changes in the application. You can scale out with a few mouse clicks and rebalance it very easily. It provides high availability, since there are multiples servers and data replicated across nodes. (For more resources related to this topic, see here.) The architecture of Couchbase Couchbase clusters consist of multiple nodes. A cluster is a collection of one or more instances of the Couchbase server that are configured as a logical cluster. The following is a Couchbase server architecture diagram: Couchbase Server Architecture As mentioned earlier, while most of the clusters' technologies work on master-slave relationships, Couchbase works on peer-to-peer node mechanism. This means there is no difference between the nodes in the cluster. The functionality provided by each node is the same. Thus, there is no single point of failure. When there is a failure of one node, another node takes up its responsibility, thus providing high availability. The data manager Any operation performed on the Couchbase database system gets stored in the memory, which acts as a caching layer. By default, every document gets stored in the memory for each read, insert, update, and so on until the memory is full. It's a drop-in replacement for Memcache. However, in order to provide persistency of the record, there is a concept called disk queue. This will flush the record to the disk asynchronously, without impacting the client request. This functionality is provided automatically by the data manager, without any human intervention. Cluster management The cluster manager is responsible for node administration and node monitoring within a cluster. Every node within a Couchbase cluster includes the cluster manager component, data storage, and data manager. It manages data storage and retrieval. It contains the memory cache layer, disk persistence mechanism, and query engine. Couchbase clients use the cluster map provided by the cluster manager to find out which node holds the required data, and then communicates with the data manager on that node to perform database operations. Buckets In RDBMS, we usually encapsulate all of the relevant data for a particular application in a database. Say, for example, we are developing an e-commerce application. We usually create a database named, e-commerce, that will be used as the logical namespace to store records in a table, such as customer or shopping cart details. It's called a bucket in a Couchbase terminology. So, whenever you want to store any document in a Couchbase cluster, you will be creating a bucket as a logical namespace as the first step. Precisely, a bucket is an independent virtual container that groups documents logically in a Couchbase cluster, which is equivalent to a database namespace in RDBMS. It can be accessed by various clients in an application. You can also configure features such as security, replication, and so on per bucket. We usually create one database and consolidate all related tables in that namespace in the RDBMS development. Likewise, in Couchbase too, you will usually create one bucket per application and encapsulate all the documents in it. Now, let me explain this concept in detail, since it's the component that administrators and developers will be working with most of the time. In fact, I used to wonder why it is named "bucket". Perhaps, we can store anything in it as we do in the physical world, hence the name "bucket". In any database system, the main purpose is to store data, and the logical namespace for storing data is called a database. Likewise, in Couchbase, the namespace for storing data is called a bucket. So in brief, it's a data container that stores data related to applications, either in the RAM or in disks. In fact, it helps you partition application data depending on an application's requirements. If you are hosting different types of applications in a cluster, say an e-commerce application and a data warehouse, you can partition them using buckets. You can create two buckets, one for the e-commerce application and another for the data warehouse. As a thumb rule, you create one bucket for each application. In an RDBMS, we store data in the forms of rows in a table, which in turn is encapsulated by a database. In Couchbase, bucket is the equivalence of database, but there is no concept of tables in Couchbase. In Couchbase, all data or records, which are referred to as documents, are stored directly in a bucket. Basically, the lowest namespace for storing document or data in Couchabase is a bucket. Internally, Couchbase arranges to store documents in different storages for different buckets. Information such as runtime statistics is collected and reported by the Couchbase cluster, grouped by the bucket type. It enables you to flush out individual buckets. You can create a separate temporary bucket rather than a regular transaction bucket when you need temporary storage for ad hoc requirements, such as reporting, temporary workspace for application programming, and so on, so that you can flush out the temporary bucket after use. The features or capabilities of a bucket depend on its type, which will be discussed subsequently. Types of buckets Couchbase provides two types of buckets, which are differentiated by the mechanism of its storage and capabilities. The two types are: Memcached Couchbase Memcached As the name suggests, buckets of the Memcached type store documents only in the RAM. This means that documents stored in the Memcache bucket are volatile in nature. Hence, such types of buckets won't survive a system reboot. Documents that are stored in such buckets will be accessible by direct address using the key-value pair mechanism. The bucket is distributed, which means that it is spread across the Couchbase cluster nodes. Since it's volatile in nature, you need to be sure of its use cases before using such types of buckets. You can use this kind of bucket to store data that is required temporarily and for better performance, since all of the data is stored in the memory but doesn't require durability. Suppose you need to display a list of countries in your application, then, instead of always fetching from the disk storage, the best way is to fetch data from the disk, populate it in the Memcached bucket, and use it in your application. In the Memcached bucket, the maximum size of a document allowed is 1 MB. All of the data is stored in the RAM, and if the bucket is running out of memory, the oldest data will be discarded. We can't replicate a Memcached bucket. It's completely compatible with the open source Memcached distributed memory object caching system. If you want to know more about the Memcached technology, you can refer to http://memcached.org/. Couchbase The Couchbase bucket type gives persistence to documents. It is distributed across a cluster of nodes and can configure replication, which is not supported in the Memcached bucket type. It's highly available, since documents are replicated across nodes in a cluster. You can verify the bucket using the web Admin UI as follows: Understanding documents By now, you must have understood the concept of buckets, its working and configuration, and so on. Let's now understand the items that get stored in it. So, what is a document? A document is a piece of information or data that gets stored in a bucket. It's the smallest item that can be stored in a bucket. As a developer, you will always be working on a bucket, in terms of documents. Documents are similar to rows in the RDBMS table schema; but in NoSQL terminologies, it will be referred to as a document. It's a way of thinking and designing data objects. All information and data should get stored as a document as it's represented in a physical document. All NoSQL databases, including Couchbase don't require a fixed schema to store documents or data in a particular bucket. These documents are represented in the form of JSON. For the time being, let's try to understand the document at a basic level. Let me show you how a document in represented in Couchbase for better clarity. You need to install the beer-sample bucket for this, which comes along with the Couchbase software installation. If you did not install it earlier, you can do it from the web console using the Settings button. The document overview The preceding screenshot shows a document, it represents a brewery and its document ID is 21st_amendment_brewery_cafe. Each document can have multiple properties/items along with its values. For example, name is the property and 21st Amendment Brewery Café is the value of the name property. So, what is this document ID? The document ID is a unique identifier that is assigned for each document in a bucket. You need to assign a unique ID whenever a document gets stored in a bucket. It's just like a primary key of a table in RDBMS. Keys and metadata As described earlier, a document key is a unique identifier for a document. The value of a document key can be any string. In addition to the key, documents usually have three more types of metadata, which are provided by the Couchbase server, unless modified by an application developer. They are as follows: rev: This is an internal revision ID meant for internal use by Couchbase. It should not be used in the application. expiration: If you want your document to expire after a certain amount of time, you can set that value here. By default, it is 0, that is, the document never expires. flags: These are numerical values specific to the client library that is updated when the document is stored. Document modeling In order to bring agility to applications that change business processes frequently, demanded by its business environment, being schemaless is a good feature. In this methodology, you don't need to be concerned about structures of data initially while designing application.This means as a developer, you don't need to worry about structures of a database schema, such as tables, or worry about splitting information into various tables, instead, you should focus on application requirement and satisfy business needs. I still recollect various moments related to design domain objects/tables, which I've been through when I was a developer, especially when I just graduated from engineering college and was into developing applications for a corporate company. Whenever I was a part of the discussions for any application requirement, at the back of the mind, I had some of these questions: How does a domain object get stored in the database? What will be the table structures? How will I retrieve the domain objects? Will it be difficult to use ORM such as Hibernate, Ejb, and so on? My point here is that instead of being mentally present in the discussion on requirement gathering and understanding the business requirements in detail, I spent more time mapping business entities in a table format. The reason being that if I did not put forward the technical constraints at that time, it would be difficult to revert about the technical challenges we could face in the data structures design later. Earlier, whenever we talked about application design, we always thought about database design structures, such as converting objects into multiple tables using normalization forms (2NF/3NF), and spent a lot of time mapping database objects to application objects using various ORM tools, such as Hibernate, Ejb, and so on. In document modeling, we will always think in terms of application requirements, that is, data or information flow while designing documents, not in terms of storage. We can simply start our application development using business representation of an entity without much concern about the storage structures. Having covered the various advantages provided by a document-based system, we will discuss in this section how to design such kinds of documents to store in any document-based database system, such as Couchbase. Then, we can effectively design domain objects for coherence with the application requirements. Whenever we model the document's structure, we need to consider two main points, one is to store all information in one document and the second is to break it down into multiple documents. You need to consider these and choose one keeping the application requirement in mind. So, an important factor is to evaluate whether the information contains unrelated data components that are independent and can be broken up into different documents or all components represent a complete domain object that could be accessed together most of the time. If data components in an information are related and will be required most of the time, together in a business logic, consider grouping them as a single logical container so that the application developer won't perceive as separate objects or documents. All of these factors depend on the nature of the application being developed and its use cases. Besides these, you need to think in terms of accessing information, such as atomicity, single unit of access, and so on. You can ask yourself a question such as, "Are we going to create or modify the information as a single unit or not?". We also need to consider concurrency, what will happen when the document is accessed by multiple clients at the same time and so on. After looking at all these considerations that you need to keep in mind while designing a document, you have two options: one is to keep all of the information in a single document, and the other is to have a separate document for every different object type. Couchbase SDK overview We have also discussed some of the guidelines used for designing document-based database system. What if we need to connect and perform operations on the Couchbase cluster in an application? This can be achieved using Couchbase client libraries, which are also collectively known as the Couchbase Software Development Kit (SDK). The Couchbase SDK APIs are language dependent. However, the concept remains the same and is applicable to all languages that are supported by the SDK. Let's now try to understand the Couchbase APIs as a concept without referring to any specific language, and then we will map these concepts to Java APIs in the Java SDK section. Couchbase SDK clients are also known as smart clients since they understand the overall status of the cluster, that is, clustermap, and keep the information of the vBucket and its server nodes updated. There are two types of Couchbase clients, as follows: Smart clients: Such clients can understand the health of the cluster and receive constant updates about the information of the cluster. Each smart client maintains a clustermap that can derive the cluster node where a document is stored using the document ID, for example, Java, .NET, and so on. Memcached-compatible: Such clients are used for applications that would be interacting with the traditional memcached bucket, which is not aware of vBucket. It needs to install Moxi (a memcached proxy) on all clients that require access to the Couchbase memcache bucket, which act as a proxy to convert the API's call to the memcache compatible call. Understanding the write operation in the Couchbase cluster Let's understand how the write operation works in the Couchbase cluster. When a write command is issued using the set operation on the Couchbase cluster, the server immediately responds once the document is written to the memory of that particular node. How do clients know which nodes in the cluster will be responsible for storing the document? You might recall that every operation requires a document ID, using this document ID, the hash algorithm determines the vBucket in which it belongs. Then, this vBucket is used to determine the node that will store the document. All mapping information, vBucket to node, is stored in each of the Couchbase client SDKs, which form the clustermap. Views Whenever we want to extract fields from JSON documents without document ID, we use views. If you want to find a document or fetch information about a document with attributes or fields of a document other than the document ID, a view is the way to go for it. Views are written in the form of MapReduce, which we have discussed earlier, that is, it consists of map and reduce phase. Couchbase implements MapReduce using the JavaScript language. The following diagram shows you how various documents are passed through the View engine to produce an index. The View engine ensures that all documents in the bucket are passed through the map method for processing and subsequently to reduce function to create indexes. When we write views, the View Engine defines materialized views for JSON documents and then queries across the dataset in the bucket. Couchbase provides a view processor to process the entire documents with map and reduce methods defined by the developer to create views. The views are maintained locally by each node for the documents stored in that particular node. Views are created for documents that are stored on the disk only. A view's life cycle A view has its own life cycle. You need to define, build, and query it, as shown in this diagram: View life cycle Initially, you will define the logic of MapReduce and build it on each node for each document that is stored locally. In the build phase, we usually emit those attributes that need to be part of indexes. Views usually work on JSON documents only. If documents are not in the JSON format or the attributes that we emit in the map function are not part of the document, then the document is ignored during the generation of views by the view engine. Finally, views are queried by clients to retrieve and find documents. After the completion of this cycle, you can still change the definition of MapReduce. For that, you need to bring the view to development mode and modify it. Thus, you have the view cycle as shown in the preceding diagram while developing a view. The preceding code shows a view. A view has predefined syntax. You can't change the method signature. Here, it follows the functional programming syntax. The preceding code shows a map method that accepts two parameters: doc: This represents the entire document meta: This represents the metadata of the document Each map will return some objects in the form of key and value. This is represented by the emit() method. The emit() method returns key and value. However, value will usually be null. Since, we can retrieve a document using the key, it's better to use that instead of using the value field of the emit() method. Custom reduce functions Why do we need custom reduce functions? Sometimes, the built-in reduce function doesn’t meet our requirements, although it will suffice most of the time. Custom reduce functions allow you to create your own reduce function. In such a reduce function, the output of map function goes to the corresponding reduce function group as per the key of the map output and the group level parameter. Couchbase ensures that output from the map will be grouped by key and supplied to reduce. Then it’s the developer's role to define logic in reduce, what to perform on the data such as aggregating, addition, and so on. To handle the incremental MapReduce functionality (that is, updating an existing view), each function must also be able to handle and consume its own output. In an incremental situation, the function must handle both new records and previously computed reductions. The input to the reduce function can be not only raw data from the map phase, but also the output of a previous reduce phase. This is called re-reduce and can be identified by the third argument of reduce(). When the re-reduce argument is false, both the key and value arguments are arrays, the value argument array matches the corresponding element with that of array of key. For example, the key[1] is the key of value[1]. The map to reduce function execution is shown as follows: Map reduce execution in a view N1QL overview So far, you have learned how to fetch documents in two ways: using document ID and views. The third way of retrieving documents is by using N1QL, pronounced as Nickel. Personally, I feel that it is a great move by Couchbase to provided SQL-like syntax, since most engineers and IT professionals are quite familiar with SQL, which is usually part of their formal education. It brings confidence in them and also provides ease of using Couchbase in their applications. Moreover, it provides most database operational activities related to development. N1QL can be used to: Store documents, that is, the INSERT command Fetch documents, that is, the SELECT command Prior to the advent of N1QL, developers needed to perform key-based operations, which was quite complex when it came to retrieving information using views and custom reduce. With the previously available options, developers needed to know the key before performing any operation on the document, which would not be the case all the time. Before N1QL features were incorporated in Couchbase, you could not perform ad hoc queries on documents in a bucket until you created views on it. Moreover, sometimes we need to perform joins or searches in the bucket, which is not possible using the document ID and views. All of these drawbacks are addressed in N1QL. I would rather say that N1QL features as an evolution in the Couchbase history. Understanding the N1QL syntax Most N1QL queries will be in the following format: SELECT [DISTINCT] <expression> FROM <data source> WHERE <expression> GROUP BY <expression> ORDER BY <expression> LIMIT <number> OFFSET <number> The preceding statement is very generic. It tells you the comprehensive options provided by N1QL in one syntax: SELECT * FROM LearningCouchbase This selects the entire document store in the bucket, LearningCouchbase. Here, we have fetched all the documents in the LearningCouchbase bucket. The output of the query is shown here; it is in the JSON document format only. All documents returned by the N1QL query will be in the array values format of the attribute, resultset. Summary You learned how to design a document base data schema and connect using connection polling from a Java base application to Couchbase. You also understood how to retrieve data from it using MapReduce based views, and you understood SQL such as syntax, N1QL to extract documents from the Couchbase database, and bucket and perform high available features with XDCR. It will also enable you to perform a full text search by integrating Elasticsearch plugins. Resources for Article: Further resources on this subject: MAPREDUCE FUNCTIONS [article] PUTTING YOUR DATABASE AT THE HEART OF AZURE SOLUTIONS [article] MOVING SPATIAL DATA FROM ONE FORMAT TO ANOTHER [article]

0
0
6982

article-image-sql-server-integration-services-ssis

Packt

03 Sep 2013

5 min read

SQL Server Integration Services (SSIS)

Packt

03 Sep 2013

5 min read

0
0
6962

article-image-mine-popular-trends-github-python-part-2

Amey Varangaonkar

27 Dec 2017

1 min read

How to Mine Popular Trends on GitHub using Python - Part 2

Amey Varangaonkar

27 Dec 2017

1 min read

[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. In this book, you will find widely used social media mining techniques for extracting useful insights to drive your business.[/box] In Part 1 of this series, we gathered the GitHub data for analysis. Here, we will analyze that data as per our requirements, to get interesting insights on the highest trending and popular tools and languages on GitHub. We have seen so far that the GitHub API provides interesting sets of information about the code repositories and metadata around the activity of its users around these repositories. In the following sections, we will analyze this data to find out which are the most popular repositories through the analysis of its descriptions and then drilling down to the watchers, forks, and issues submitted on the emerging technologies. Since, technology is evolving so rapidly, this approach could help us to stay on top of the latest trending technologies. In order to find out what are the trending technologies, we will perform the analysis in a few steps: Identifying top technologies First of all, we will use text analytics techniques to identify what are the most popular phrases related to technologies in repositories from 2017. Our analysis will be focused on the most frequent bigrams. We import a nltk.collocation module which implements n-gram search tools: import nltk from nltk.collocations import * Then, we convert the clean description column into a list of tokens: list_documents = df['clean'].apply(lambda x: x.split()).tolist() As we perform an analysis on documents, we will use the method from_documents instead of a default one from_words. The difference between these two methods lies in then input data format. The one used in our case takes as argument a list of tokens and searches for n-grams document-wise instead of corpus-wise. It protects against detecting bi-grams composed of the last word of one document and the first one of another one: bigram_measures = nltk.collocations.BigramAssocMeasures() bigram_finder = BigramCollocationFinder.from_documents(list_documents) We take into account only bi-grams which appear at least three times in our document set: bigram_finder.apply_freq_filter(3) We can use different association measures to find the best bi-grams, such as raw frequency, pmi, student t, or chi sq. We will mostly be interested in the raw frequency measure, which is the simplest and most convenient indicator in our case. We get top 20 bigrams according to raw_freq measure: bigrams = bigram_finder.nbest(bigram_measures.raw_freq,20) We can also obtain their scores by applying the score_ngrams method: scores = bigram_finder.score_ngrams(bigram_measures.raw_freq) All the other measures are implemented as methods of BigramCollocationFinder. To try them, you can replace raw_freq by, respectively, pmi, student_t, and chi_sq. However, to create a visualization we will need the actual number of occurrences instead of scores. We create a list by using the ngram_fd.items() method and we sort it in descending order. ngram = list(bigram_finder.ngram_fd.items()) ngram.sort(key=lambda item: item[-1], reverse=True) It returns a dictionary of tuples which contain an embedded tuple and its frequency. We transform it into a simple list of tuples where we join bigram tokens: frequency = [(" ".join(k), v) for k,v in ngram] For simplicity reasons we put the frequency list into a dataframe: df=pd.DataFrame(frequency) And then, we plot the top 20 technologies in a bar chart: import matplotlib.pyplot as plt plt.style.use('ggplot') df.set_index([0], inplace = True) df.sort_values(by = [1], ascending = False).head(20).plot(kind = 'barh') plt.title('Trending Technologies') plt.ylabel('Technology') plt.xlabel('Popularity') plt.legend().set_visible(False) plt.axvline(x=14, color='b', label='Average', linestyle='--', linewidth=3) for custom in [0, 10, 14]: plt.text(14.2, custom, "Neural Networks", fontsize = 12, va = 'center', bbox = dict(boxstyle='square', fc='white', ec='none')) plt.show() We've added an additional line which helps us to aggregate all technologies related to neural networks. It is done manually by selecting elements by indices, (0,10,14) in this case. This operation might be useful for interpretation. The preceding analysis provides us with an interesting set of the most popular technologies on GitHub. It includes topics for software engineering, programming languages, and artificial intelligence. An important thing to be noted is that technology around neural networks emerges more than once, notably, deep learning, TensorFlow, and other specific projects. This is not surprising, since neural networks, which are an important component in the field of artificial intelligence, have been spoken about and practiced heavily in the last few years. So, if you're an aspiring programmer interested in AI and machine learning, this is a field to dive into! Programming languages The next step in our analysis is the comparison of popularity between different programming languages. It will be based on samples of the top 1,000 most popular repositories by year. Firstly, we get the data for last three years: queries = ["created:>2017-01-01", "created:2015-01-01..2015-12-31", "created:2016-01-01..2016-12-31"] We reuse the search_repo_paging function to collect the data from the GitHub API and we concatenate the results to a new dataframe. df = pd.DataFrame() for query in queries: data = search_repo_paging(query) data = pd.io.json.json_normalize(data) df = pd.concat([df, data]) We convert the dataframe to a time series based on the create_at column df['created_at'] = df['created_at'].apply(pd.to_datetime) df = df.set_index(['created_at']) Then, we use aggregation method groupby which restructures the data by language and year, and we count the number of occurrences by language: dx = pd.DataFrame(df.groupby(['language', df.index.year])['language'].count()) We represent the results on a bar chart: fig, ax = plt.subplots() dx.unstack().plot(kind='bar', title = 'Programming Languages per Year', ax= ax) ax.legend(['2015', '2016', '2017'], title = 'Year') plt.show() The preceding graph shows a multitude of programming languages from assembly, C, C#, Java, web, and mobile languages, to modern ones like Python, Ruby, and Scala. Comparing over the three years, we see some interesting trends. We notice HTML, which is the bedrock of all web development, has remained very stable over the last three years. This is not something that will not be replaced in a hurry. Once very popular, Ruby now has a decrease in popularity. The popularity of Python, also our language of choice for this book, is going up. Finally, the cross-device programming language, Swift, initially created by Apple but now open source, is getting extremely popular over time. It could be interesting to see in the next few years, if these trends change or hold true for long. Programming languages used in top technologies Now we know what are the top programming languages and technologies quoted in repositories description. In this section we will try to combine this information and find out what are the main programming languages for each technology. We select four technologies from previous section and print corresponding programming languages. We look up the column containing cleaned repository description and create a set of the languages related to the technology. Using a set will assure that we have unique Values. technologies_list = ['software engineering', 'deep learning', 'open source', 'exercise practice'] for tech in technologies_list: print(tech) print(set(df[df['clean'].str.contains(tech)]['language'])) software engineering {'HTML', 'Java'} deep learning {'Jupyter Notebook', None, 'Python'} open source {None, 'PHP', 'Java', 'TypeScript', 'Go', 'JavaScript', 'Ruby', 'C++'} exercise practice {'CSS', 'JavaScript', 'HTML'} Following the text analysis of the descriptions of the top technologies and then extracting the programming languages for them we notice the following: You can do a lot more analysis with this GitHub data such as: Want to know how? You can check out our book Python Social Media Analytics to get a detailed walkthrough of these topics.

0
0
6861

article-image-structural-equation-modeling-and-confirmatory-factor-analysis

Packt

06 Feb 2015

30 min read

Structural Equation Modeling and Confirmatory Factor Analysis

Packt

06 Feb 2015

30 min read

0
0
6841

article-image-9-recommended-blockchain-online-courses

Guest Contributor

27 Sep 2018

7 min read

9 recommended blockchain online courses

Guest Contributor

27 Sep 2018

7 min read

Blockchain is reshaping the world as we know it. And we are not talking metaphorically because the new technology is really influencing everything from online security and data management to governance and smart contracting. Statistical reports support these claims. According to the study, the blockchain universe grows by over 40% annually, while almost 70% of banks are already experimenting with this technology. IT experts at the Editing AussieWritings.com Services claim that the potential in this field is almost limitless: “Blockchain offers a myriad of practical possibilities, so you definitely want to get acquainted with it more thoroughly.” Developers who are curious about blockchain can turn it into a lucrative career opportunity since it gives them the chance to master the art of cryptography, hierarchical distribution, growth metrics, transparent management, and many more. There were 5,743 mostly full-time job openings calling for blockchain skills in the last 12 months - representing the 320% increase - while the biggest freelancing website Upwork reported more than 6,000% year-over-year growth. In this post, we will recommend our 9 best blockchain online courses. Let’s take a look! Udemy Udemy offers users one of the most comprehensive blockchain learning sources. The target audience is people who have heard a little bit about the latest developments in this field, but want to understand more. This online course can help you to fully understand how the blockchain works, as well as get to grips with all that surrounds it. Udemy breaks down the course into several less complicated units, allowing you to figure out this complex system rather easily. It costs $19.99, but you can probably get it with a 40% discount. The one downside, however, is that content quality in terms of subject scope can vary depending on the instructor, but user reviews are a good way to gauge quality. Each tutorial lasts approximately 30 minutes, but it also depends on your own tempo and style of work. Pluralsight Pluralsight is an excellent beginner-level blockchain course. It comes in three versions: Blockchain Fundamentals, Surveying Blockchain Technologies for Enterprise, and Introduction to Bitcoin and Decentralized Technology. Course duration varies from 80 to 200 minutes depending on the package. The price of Pluralsight is $29 a month or $299 a year. Choosing one of these options, you are granted access to the entire library of documents, including course discussions, learning paths, channels, skill assessments, and other similar tools. Packt Publishing Packt Publishing has a wide portfolio of learning products on Blockchain for varying levels of experience in the field from beginners to experts. And what’s even more interesting is that you can choose your learning format from books, ebooks to videos, courses and live courses. Or you could simply subscribe to MAPT, their library to gain access to all products at a reasonable price of $29 monthly and $150 annually. It offers several books and videos on the leading blockchain technology. You can purchase 5 blockchain titles at a discounted rate of $50. Here’s the list of top blockchain courses offered by Packt Publishing: Exploring Blockchain and Crypto-currencies: You will gain the foundational understanding of blockchain and crypto-currencies through various use-cases. Building Blockchain Projects: In this, you will be able to develop real-time practical DApps with Ethereum and JavaScript. Mastering Blockchain - Second Edition: You can learn about cryptography and cryptocurrencies, so you can build highly secure, decentralized applications and conduct trusted in-app transactions. Hands-On Blockchain with Hyperledger: This book will help you leverage the power of Hyperledger Fabric to develop Blockchain-based distributed ledgers with ease. Learning Blockchain Application Development [video ]: This interactive video will help you learn build smart contracts and DApps on Ethereum. Create Ethereum and Blockchain Applications using Solidity [video ]: This video will help you learn about Ethereum, Solidity, DAO, ICO, Bitcoin, Altcoin, Website Security, Ripple, Litecoin, Smart Contracts, and Apps. Cryptozombies Cryptozombies is an online blockchain course based on gamification elements. The tool teaches you to write smart contracts in Solidity through building your own crypto-collectibles game. It is entirely Ethereum-focused, but you don’t need any previous experience to understand how Solidity works. There is a step by step guide that explains to you even the smallest details, so you can quickly learn to create your own fully-functional blockchain-based game. The best thing about Cryptozombies is that you can test it for free and give up in case you don’t like it. Coursera The blockchain is the epicenter of the cryptocurrency world, so it’s necessary to study it if you want to deal with Bitcoin and other digital currencies. Coursera is the leading online resource in the field of virtual currencies, so you might want to check it out. After this course like Blockchain Specialization, you’ll know everything you need to be able to separate fact from fiction when reading claims about Bitcoin and other cryptocurrencies. You’ll have the conceptual foundations you need to engineer to secure software that interacts with the Bitcoin network. And you’ll be able to integrate ideas from Bitcoin in your own projects. The course is a 4-part course spanning a duration 4 weeks, but you can take each part separately. The price depends on the level and features you choose. LinkedIn Learning (formerly known as Lynda) LinkedIn Learning (what used to be Lynda) doesn't offer a specific blockchain course, but it does have a wide range of industry-related learning sources. A search for ‘blockchain’ will present you with almost 100 relevant video courses. You can find all sorts of lessons here, from beginner to expert levels. Lynda allows you to customize selection according to video duration, authors, software, subjects, etc. You can access the library for $15 a month. B9Lab B9Lab ETH-25 Certified Online Ethereum Developer Course is another course that promotes blockchain technology aimed at the Ethereum platform. It’s a 12-week in-depth learning solution that targets experienced programmers. B9Lab introduces everything there is to know about blockchain and how to build useful applications. Participants are taught about the Ethereum platform, the programming language Solidity, how to use web3 and the Truffle framework, and how to tie everything together. The price is €1450 or about $1700. IBM IBM made a self-paced blockchain course, titled Blockchain Essentials that lasts over two hours. The video lectures and lab in this course help you learn about blockchain for business and explore key use cases that demonstrate how the technology adds value. You can learn how to leverage blockchain benefits, transform your business with the new technology, and transfer assets. Besides that, you get a nice wrap-up and a quiz to test your knowledge upon completion. IBM’s course is free of charge. Khan Academy Khan Academy is the last, but certainly not the least important online course on our list. It gives users a comprehensive overview of blockchain-powered systems, particularly Bitcoin. Using this platform, you can learn more on cryptocurrency transactions, security, proof of work, etc. As an online education platform, Khan Academy won’t cost you a dime. [dropcap]B[/dropcap]lockchain is the groundbreaking technology that opens new boundaries in almost every field of business. It directly influences financial markets, data management, digital security, and a variety of other industries. In this post, we presented 9 best blockchain online courses you should try. These sources can teach you everything there is to know about the blockchain basics. Take some time to check them out and you won’t regret it! Author Bio: Olivia is a passionate blogger who writes on topics of digital marketing, career, and self-development. She constantly tries to learn something new and to share this experience on various websites. Connect with her on Facebook and Twitter. Google introduces Machine Learning courses for AI beginners Microsoft start AI School to teach Machine Learning and Artificial Intelligence.

0
0
6810

article-image-oracle-e-business-suite-creating-bank-accounts-and-cash-forecasts

Packt

19 Aug 2011

3 min read

Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts

Packt

19 Aug 2011

3 min read

Oracle E-Business Suite 12 Financials Cookbook Take the hard work out of your daily interactions with E-Business Suite financials by using the 50+ recipes from this cookbook Introduction Oracle E-business suite The liquidity of an organization is managed in Oracle Cash Management; this includes the reconciliation of the cashbook to the bank statements, and forecasting future cash requirements. In this article, we will look at how to create bank accounts and cash forecasts. Cash management integrates with Payables, Receivables, Payroll, Treasury, and General Ledger. Let's start by looking at the cash management process: The Bank generates statements. The statements are sent to the organization electronically or by post. The Treasury Administrator loads and verifies the bank statement into cash management. The statements can also be manually entered into cash management. The loaded statements are reconciled to the cash book transactions. The results are reviewed, and amended if required. The Treasury Administrator creates the journals for transactions in the General Ledger. Creating bank accounts Oracle Cash Management provides us with the functionality to create bank accounts. In this recipe, we will create a bank account for a bank called Shepherd Bank, for one of their branches called Kings Cross branch. Getting ready Log in to Oracle E-Business Suite R12 with the username and password assigned to you by the system administrator. If you are working on the Vision demonstration database, you can use OPERATIONS/WELCOME as the USERNAME/PASSWORD. We also need to create a bank before we can create the bank account. Let's look at how to create a bank and the branch: Select the Cash Management responsibility. Navigate to Setup | Banks | Banks.(Move the mouse over the image to enlarge it.) In the Banks tab, click on the Create button. Select the Create new bank option. In the Country field, enter United States. In the Bank Name field, enter Shepherds Bank. In the Bank Number field, enter JN316. Click on the Finish button. Let's create the branch and the address: (Move the mouse over the image to enlarge it.) Click the Create Branch icon: The Country and the Bank Name are automatically entered. Click on the Continue button.(Move the mouse over the image to enlarge it.) In the Branch Name field, enter Kings Cross. Select ABA as the Branch Type. Click on the Save and Next button to create the Branch address.(Move the mouse over the image to enlarge it.) In the Branch Address form, click on the create button. In the Country field, enter United States. In the Address Line 1 field, enter 4234 Red Eagle Road. In the City field, enter Sacred Heart. In the County field, enter Renville. In the State field, enter MN. In the Postal Code field, enter 56285. Ensure that the Status field is Active. Click on the Apply button. Click on the Finish button.

0
0
6802

Packt

20 May 2016

28 min read

Visualizations Using CCC

Packt

20 May 2016

28 min read

0
0
6801

Packt

21 Sep 2015

18 min read

Scraping the Data

Packt

21 Sep 2015

18 min read

In this article by Richard Lawson, author of the book Web Scraping with Python, we will first cover a browser extension called Firebug Lite to examine a web page, which you may already be familiar with if you have a web development background. Then, we will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. Finally, the article will conclude with a comparison of these three scraping alternatives. (For more resources related to this topic, see here.) Analyzing a web page To understand how a web page is structured, we can try examining the source code. In most web browsers, the source code of a web page can be viewed by right-clicking on the page and selecting the View page source option: The data we are interested in is found in this part of the HTML: <table> <tr id="places_national_flag__row"><td class="w2p_fl"><label for="places_national_flag" id="places_national_flag__label">National Flag: </label></td><td class="w2p_fw"><img src="/places/static/images/flags/gb.png" /></td><td class="w2p_fc"></td></tr> … <tr id="places_neighbours__row"><td class="w2p_fl"><label for="places_neighbours" id="places_neighbours__label">Neighbours: </label></td><td class="w2p_fw"><div><a href="/iso/IE">IE </a></div></td><td class="w2p_fc"></td></tr></table> This lack of whitespace and formatting is not an issue for a web browser to interpret, but it is difficult for us. To help us interpret this table, we will use the Firebug Lite extension, which is available for all web browsers at https://getfirebug.com/firebuglite. Firefox users can install the full Firebug extension if preferred, but the features we will use here are included in the Lite version. Now, with Firebug Lite installed, we can right-click on the part of the web page we are interested in scraping and select Inspect with Firebug Lite from the context menu, as shown here: This will open a panel showing the surrounding HTML hierarchy of the selected element: In the preceding screenshot, the country attribute was clicked on and the Firebug panel makes it clear that the country area figure is included within a <td> element of class w2p_fw, which is the child of a <tr> element of ID places_area__row. We now have all the information needed to scrape the area data. Three approaches to scrape a web page Now that we understand the structure of this web page we will investigate three different approaches to scraping its data, firstly with regular expressions, then with the popular BeautifulSoup module, and finally with the powerful lxml module. Regular expressions If you are unfamiliar with regular expressions or need a reminder, there is a thorough overview available at https://docs.python.org/2/howto/regex.html. To scrape the area using regular expressions, we will first try matching the contents of the <td> element, as follows: >>> import re >>> url = 'http://example.webscraping.com/view/United Kingdom-239' >>> html = download(url) >>> re.findall('<td class="w2p_fw">(.*?)</td>', html) ['<img src="/places/static/images/flags/gb.png" />', '244,820 square kilometres', '62,348,447', 'GB', 'United Kingdom', 'London', '<a href="/continent/EU">EU</a>', '.uk', 'GBP', 'Pound', '44', '@# #@@|@## #@@|@@# #@@|@@## #@@|@#@ #@@|@@#@ #@@|GIR0AA', '^(([A-Z]\d{2}[A-Z]{2})|([A-Z]\d{3}[A-Z]{2})|([A-Z]{2}\d{2} [A-Z]{2})|([A-Z]{2}\d{3}[A-Z]{2})|([A-Z]\d[A-Z]\d[A-Z]{2}) |([A-Z]{2}\d[A-Z]\d[A-Z]{2})|(GIR0AA))$', 'en-GB,cy-GB,gd', '<div><a href="/iso/IE">IE </a></div>'] This result shows that the <td class="w2p_fw"> tag is used for multiple country attributes. To isolate the area, we can select the second element, as follows: >>> re.findall('<td class="w2p_fw">(.*?)</td>', html)[1] '244,820 square kilometres' This solution works but could easily fail if the web page is updated. Consider if the website is updated and the population data is no longer available in the second table row. If we just need to scrape the data now, future changes can be ignored. However, if we want to rescrape this data in future, we want our solution to be as robust against layout changes as possible. To make this regular expression more robust, we can include the parent <tr> element, which has an ID, so it ought to be unique: >>> re.findall('<tr id="places_area__row"><td class="w2p_fl"><label for="places_area" id="places_area__label">Area: </label></td><td class="w2p_fw">(.*?)</td>', html) ['244,820 square kilometres'] This iteration is better; however, there are many other ways the web page could be updated in a way that still breaks the regular expression. For example, double quotation marks might be changed to single, extra space could be added between the <td> tags, or the area_label could be changed. Here is an improved version to try and support these various possiblilities: >>> re.findall('<tr id="places_area__row">.*?<tds*class=["']w2p_fw["']>(.*?) </td>', html)[0] '244,820 square kilometres' This regular expression is more future-proof but is difficult to construct, becoming unreadable. Also, there are still other minor layout changes that would break it, such as if a title attribute was added to the <td> tag. From this example, it is clear that regular expressions provide a simple way to scrape data but are too brittle and will easily break when a web page is updated. Fortunately, there are better solutions. Beautiful Soup Beautiful Soup is a popular library that parses a web page and provides a convenient interface to navigate content. If you do not already have it installed, the latest version can be installed using this command: pip install beautifulsoup4 The first step with Beautiful Soup is to parse the downloaded HTML into a soup document. Most web pages do not contain perfectly valid HTML and Beautiful Soup needs to decide what is intended. For example, consider this simple web page of a list with missing attribute quotes and closing tags: <ul class=country> <li>Area <li>Population </ul> If the Population item is interpreted as a child of the Area item instead of the list, we could get unexpected results when scraping. Let us see how Beautiful Soup handles this: >>> from bs4 import BeautifulSoup >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> # parse the HTML >>> soup = BeautifulSoup(broken_html, 'html.parser') >>> fixed_html = soup.prettify() >>> print fixed_html <html> <body> <ul class="country"> <li>Area</li> <li>Population</li> </ul> </body> </html> Here, BeautifulSoup was able to correctly interpret the missing attribute quotes and closing tags, as well as add the <html> and <body> tags to form a complete HTML document. Now, we can navigate to the elements we want using the find() and find_all() methods: >>> ul = soup.find('ul', attrs={'class':'country'}) >>> ul.find('li') # returns just the first match <li>Area</li> >>> ul.find_all('li') # returns all matches [<li>Area</li>, <li>Population</li>] Beautiful Soup overview Here are the common methods and parameters you will use when scraping web pages with Beautiful Soup: BeautifulSoup(markup, builder): This method creates the soup object. The markup parameter can be a string or file object, and builder is the library that parses the markup parameter. find_all(name, attrs, text, **kwargs): This method returns a list of elements matching the given tag name, dictionary of attributes, and text. The contents of kwargs are used to match attributes. find(name, attrs, text, **kwargs): This method is the same as find_all(), except that it returns only the first match. If no element matches, it returns None. prettify(): This method returns the parsed HTML in an easy-to-read format with indentation and line breaks. For a full list of available methods and parameters, the official documentation is available at http://www.crummy.com/software/BeautifulSoup/bs4/doc/. Now, using these techniques, here is a full example to extract the area from our example country: >>> from bs4 import BeautifulSoup >>> url = 'http://example.webscraping.com/places/view/ United-Kingdom-239' >>> html = download(url) >>> soup = BeautifulSoup(html) >>> # locate the area row >>> tr = soup.find(attrs={'id':'places_area__row'}) >>> td = tr.find(attrs={'class':'w2p_fw'}) # locate the area tag >>> area = td.text # extract the text from this tag >>> print area 244,820 square kilometres This code is more verbose than regular expressions but easier to construct and understand. Also, we no longer need to worry about problems in minor layout changes, such as extra whitespace or tag attributes. Lxml Lxml is a Python wrapper on top of the libxml2 XML parsing library written in C, which makes it faster than Beautiful Soup but also harder to install on some computers. The latest installation instructions are available at http://lxml.de/installation.html. As with Beautiful Soup, the first step is parsing the potentially invalid HTML into a consistent format. Here is an example of parsing the same broken HTML: >>> import lxml.html >>> broken_html = '<ul class=country><li>Area<li>Population</ul>' >>> tree = lxml.html.fromstring(broken_html) # parse the HTML >>> fixed_html = lxml.html.tostring(tree, pretty_print=True) >>> print fixed_html <ul class="country"> <li>Area</li> <li>Population</li> </ul> As with BeautifulSoup, lxml was able to correctly parse the missing attribute quotes and closing tags, although it did not add the <html> and <body> tags. After parsing the input, lxml has a number of different options to select elements, such as XPath selectors and a find() method similar to Beautiful Soup. Instead, we will use CSS selectors here and in future examples, because they are more compact. Also, some readers will already be familiar with them from their experience with jQuery selectors. Here is an example using the lxml CSS selectors to extract the area data: >>> tree = lxml.html.fromstring(html) >>> td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] >>> area = td.text_content() >>> print area 244,820 square kilometres The key line with the CSS selector is highlighted. This line finds a table row element with the places_area__row ID, and then selects the child table data tag with the w2p_fw class. CSS selectors CSS selectors are patterns used for selecting elements. Here are some examples of common selectors you will need: Select any tag: * Select by tag <a>: a Select by class of "link": .link Select by tag <a> with class "link": a.link Select by tag <a> with ID "home": a#home Select by child <span> of tag <a>: a > span Select by descendant <span> of tag <a>: a span Select by tag <a> with attribute title of "Home": a[title=Home] The CSS3 specification was produced by the W3C and is available for viewing at http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. Lxml implements most of CSS3, and details on unsupported features are available at https://pythonhosted.org/cssselect/#supported-selectors. Note that, internally, lxml converts the CSS selectors into an equivalent XPath. Comparing performance To help evaluate the trade-offs of the three scraping approaches described in this article, it would help to compare their relative efficiency. Typically, a scraper would extract multiple fields from a web page. So, for a more realistic comparison, we will implement extended versions of each scraper that extract all the available data from a country's web page. To get started, we need to return to Firebug to check the format of the other country features, as shown here: Firebug shows that each table row has an ID starting with places_ and ending with __row. Then, the country data is contained within these rows in the same format as the earlier area example. Here are implementations that use this information to extract all of the available country data: FIELDS = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') import re def re_scraper(html): results = {} for field in FIELDS: results[field] = re.search('<tr id="places_%s__row">.*?<td class="w2p_fw">(.*?)</td>' % field, html).groups()[0] return results from bs4 import BeautifulSoup def bs_scraper(html): soup = BeautifulSoup(html, 'html.parser') results = {} for field in FIELDS: results[field] = soup.find('table').find('tr', id='places_%s__row' % field).find('td', class_='w2p_fw').text return results import lxml.html def lxml_scraper(html): tree = lxml.html.fromstring(html) results = {} for field in FIELDS: results[field] = tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() return results Scraping results Now that we have complete implementations for each scraper, we will test their relative performance with this snippet: import time NUM_ITERATIONS = 1000 # number of times to test each scraper html = download('http://example.webscraping.com/places/view/ United-Kingdom-239') for name, scraper in [('Regular expressions', re_scraper), ('BeautifulSoup', bs_scraper), ('Lxml', lxml_scraper)]: # record start time of scrape start = time.time() for i in range(NUM_ITERATIONS): if scraper == re_scraper: re.purge() result = scraper(html) # check scraped result is as expected assert(result['area'] == '244,820 square kilometres') # record end time of scrape and output the total end = time.time() print '%s: %.2f seconds' % (name, end – start) This example will run each scraper 1000 times, check whether the scraped results are as expected, and then print the total time taken. Note the highlighted line calling re.purge(); by default, the regular expression module will cache searches and this cache needs to be cleared to make a fair comparison with the other scraping approaches. Here are the results from this script on my computer: $ python performance.py Regular expressions: 5.50 seconds BeautifulSoup: 42.84 seconds Lxml: 7.06 seconds The results on your computer will quite likely be different because of the different hardware used. However, the relative difference between each approach should be equivalent. The results show that Beautiful Soup is over six times slower than the other two approaches when used to scrape our example web page. This result could be anticipated because lxml and the regular expression module were written in C, while BeautifulSoup is pure Python. An interesting fact is that lxml performed comparatively well with regular expressions, since lxml has the additional overhead of having to parse the input into its internal format before searching for elements. When scraping many features from a web page, this initial parsing overhead is reduced and lxml becomes even more competitive. It really is an amazing module! Overview The following table summarizes the advantages and disadvantages of each approach to scraping: Scraping approach Performance Ease of use Ease to install Regular expressions Fast Hard Easy (built-in module) Beautiful Soup Slow Easy Easy (pure Python) Lxml Fast Easy Moderately difficult If the bottleneck to your scraper is downloading web pages rather than extracting data, it would not be a problem to use a slower approach, such as Beautiful Soup. Or, if you just need to scrape a small amount of data and want to avoid additional dependencies, regular expressions might be an appropriate choice. However, in general, lxml is the best choice for scraping, because it is fast and robust, while regular expressions and Beautiful Soup are only useful in certain niches. Adding a scrape callback to the link crawler Now that we know how to scrape the country data, we can integrate this into the link crawler. To allow reusing the same crawling code to scrape multiple websites, we will add a callback parameter to handle the scraping. A callback is a function that will be called after certain events (in this case, after a web page has been downloaded). This scrape callback will take a url and html as parameters and optionally return a list of further URLs to crawl. Here is the implementation, which is simple in Python: def link_crawler(..., scrape_callback=None): … links = [] if scrape_callback: links.extend(scrape_callback(url, html) or []) … The new code for the scraping callback function are highlighted in the preceding snippet. Now, this crawler can be used to scrape multiple websites by customizing the function passed to scrape_callback. Here is a modified version of the lxml example scraper that can be used for the callback function: def scrape_callback(url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [tree.cssselect('table > tr#places_%s__row > td.w2p_fw' % field)[0].text_content() for field in FIELDS] print url, row This callback function would scrape the country data and print it out. Usually, when scraping a website, we want to reuse the data, so we will extend this example to save results to a CSV spreadsheet, as follows: import csv class ScrapeCallback: def __init__(self): self.writer = csv.writer(open('countries.csv', 'w')) self.fields = ('area', 'population', 'iso', 'country', 'capital', 'continent', 'tld', 'currency_code', 'currency_name', 'phone', 'postal_code_format', 'postal_code_regex', 'languages', 'neighbours') self.writer.writerow(self.fields) def __call__(self, url, html): if re.search('/view/', url): tree = lxml.html.fromstring(html) row = [] for field in self.fields: row.append(tree.cssselect('table > tr#places_{}__row > td.w2p_fw'.format(field)) [0].text_content()) self.writer.writerow(row) To build this callback, a class was used instead of a function so that the state of the csv writer could be maintained. This csv writer is instantiated in the constructor, and then written to multiple times in the __call__ method. Note that __call__ is a special method that is invoked when an object is "called" as a function, which is how the cache_callback is used in the link crawler. This means that scrape_callback(url, html) is equivalent to calling scrape_callback.__call__(url, html). For further details on Python's special class methods, refer to https://docs.python.org/2/reference/datamodel.html#special-method-names. This code shows how to pass this callback to the link crawler: link_crawler('http://example.webscraping.com/', '/(index|view)', max_depth=-1, scrape_callback=ScrapeCallback()) Now, when the crawler is run with this callback, it will save results to a CSV file that can be viewed in an application such as Excel or LibreOffice: Success! We have completed our first working scraper. Summary In this article, we walked through a variety of ways to scrape data from a web page. Regular expressions can be useful for a one-off scrape or to avoid the overhead of parsing the entire web page, and BeautifulSoup provides a high-level interface while avoiding any difficult dependencies. However, in general, lxml will be the best choice because of its speed and extensive functionality, and we will use it in future examples. Resources for Article: Further resources on this subject: Scientific Computing APIs for Python [article] Bizarre Python [article] Optimization in Python [article]

0
0
6792

article-image-working-with-sparks-graph-processing-library-graphframes

Pravin Dhandre

11 Jan 2018

12 min read

Working with Spark’s graph processing library, GraphFrames

Pravin Dhandre

11 Jan 2018

12 min read

0
0
6789

Packt

16 Dec 2014

9 min read

Ridge Regression

Packt

16 Dec 2014

9 min read

In this article by Patrick R. Nicolas, the author of the book Scala for Machine Learning, we will cover the basics of ridge regression. The purpose of regression is to minimize a loss function, the residual sum of squares (RSS) being the one commonly used. The problem of overfitting can be addressed by adding a penalty term to the loss function. The penalty term is an element of the larger concept of regularization. (For more resources related to this topic, see here.) Ln roughness penalty Regularization consists of adding a penalty function J(w) to the loss function (or RSS in the case of a regressive classifier) in order to prevent the model parameters (or weights) from reaching high values. A model that fits a training set very well tends to have many features variable with relatively large weights. This process is known as shrinkage. Practically, shrinkage consists of adding a function with model parameters as an argument to the loss function: The penalty function is completely independent from the training set {x,y}. The penalty term is usually expressed as a power to function of the norm of the model parameters (or weights) wd. For a model of D dimension the generic Lp-norm is defined as follows: Notation Regularization applies to parameters or weights associated to an observation. In order to be consistent with our notation w0 being the intercept value, the regularization applies to the parameters w1 …wd. The two most commonly used penalty functions for regularization are L1 and L2. Regularization in machine learning The regularization technique is not specific to the linear or logistic regression. Any algorithm that minimizes the residual sum of squares, such as support vector machine or feed-forward neural network, can be regularized by adding a roughness penalty function to the RSS. The L1 regularization applied to the linear regression is known as the Lasso regularization. The Ridge regression is a linear regression that uses the L2 regularization penalty. You may wonder which regularization makes sense for a given training set. In a nutshell, L2 and L1 regularizations differ in terms of computation efficiency, estimation, and features selection (refer to the 13.3 L1 regularization: basics section in the book Machine Learning: A Probabilistic Perspective, and the Feature selection, L1 vs. L2 regularization, and rotational invariance paper available at http://www.machinelearning.org/proceedings/icml2004/papers/354.pdf). The various differences between the two regularizations are as follows: Model estimation: L1 generates a sparser estimation of the regression parameters than L2. For large non-sparse dataset, L2 has a smaller estimation error than L1. Feature selection: L1 is more effective in reducing the regression weights for features with high value than L2. Therefore, L1 is a reliable features selection tool. Overfitting: Both L1 and L2 reduce the impact of overfitting. However, L1 has a significant advantage in overcoming overfitting (or excessive complexity of a model) for the same reason it is more appropriate for selecting features. Computation: L2 is conducive to a more efficient computation model. The summation of the loss function and L2 penalty w2 is a continuous and differentiable function for which the first and second derivative can be computed (convex minimization). The L1 term is the summation of |wi|, and therefore, not differentiable. Terminology The ridge regression is sometimes called the penalized least squares regression. The L2 regularization is also known as the weight decay. Let's implement the ridge regression, and then evaluate the impact of the L2-norm penalty factor. Ridge regression The ridge regression is a multivariate linear regression with a L2 norm penalty term, and can be calculated as follows: The computation of the ridge regression parameters requires the resolution of the system of linear equations similar to the linear regression. Matrix representation of ridge regression closed form is as follows: I is the identity matrix and it is using the QR decomposition, as shown here: Implementation The implementation of the ridge regression adds L2 regularization term to the multiple linear regression computation of the Apache Commons Math library. The methods of RidgeRegression have the same signature as its ordinary least squares counterpart. However, the class has to inherit the abstract base class AbstractMultipleLinearRegression in the Apache Commons Math and override the generation of the QR decomposition to include the penalty term, as shown in the following code: class RidgeRegression[T <% Double](val xt: XTSeries[Array[T]], val y: DblVector, val lambda: Double) { extends AbstractMultipleLinearRegression with PipeOperator[Array[T], Double] { private var qr: QRDecomposition = null private[this] val model: Option[RegressionModel] = … … } Besides the input time series xt and the labels y, the ridge regression requires the lambda factor of the L2 penalty term. The instantiation of the class train the model. The steps to create the ridge regression models are as follows: Extract the Q and R matrices for the input values, newXSampleData (line 1) Compute the weights using the calculateBeta defined in the base class (line 2) Return the tuple regression weights calculateBeta and the residuals calculateResiduals private val model: Option[(DblVector, Double)] = { this.newXSampleData(xt.toDblMatrix) //1 newYSampleData(y) val _rss = calculateResiduals.toArray.map(x => x*x).sum val wRss = (calculateBeta.toArray, _rss) //2 Some(RegressionModel(wRss._1, wRss._2)) } The QR decomposition in the AbstractMultipleLinearRegression base class does not include the penalty term (line 3); the identity matrix with lambda factor in the diagonal has to be added to the matrix to be decomposed (line 4). override protected def newXSampleData(x: DblMatrix): Unit = { super.newXSampleData(x) //3 val xtx: RealMatrix = getX val nFeatures = xt(0).size Range(0, nFeatures).foreach(i => xtx.setEntry(i,i,xtx.getEntry(i,i) + lambda)) //4 qr = new QRDecomposition(xtx) } The regression weights are computed by resolving the system of linear equations using substitution on the QR matrices. It overrides the calculateBeta function from the base class: override protected def calculateBeta: RealVector = qr.getSolver().solve(getY()) Test case The objective of the test case is to identify the impact of the L2 penalization on the RSS value, and then compare the predicted values with original values. Let's consider the first test case related to the regression on the daily price variation of the Copper ETF (symbol: CU) using the stock daily volatility and volume as feature. The implementation of the extraction of observations is identical as with the least squares regression: val src = DataSource(path, true, true, 1) val price = src |> YahooFinancials.adjClose val volatility = src |> YahooFinancials.volatility val volume = src |> YahooFinancials.volume //1 val _price = price.get.toArray val deltaPrice = XTSeries[Double](_price .drop(1) .zip(_price.take(_price.size -1)) .map( z => z._1 - z._2)) //2 val data = volatility.get .zip(volume.get) .map(z => Array[Double](z._1, z._2)) //3 val features = XTSeries[DblVector](data.take(data.size-1)) val regression = new RidgeRegression[Double](features, deltaPrice, lambda) //4 regression.rss match { case Some(rss) => Display.show(rss, logger) …. The observed data, ETF daily price, and the features (volatility and volume) are extracted from the source src (line 1). The daily price change, deltaPrice, is computed using a combination of Scala take and drop methods (line 2). The features vector is created by zipping volatility and volume (line 3). The model is created by instantiating the RidgeRegression class (line 4). The RSS value, rss, is finally displayed (line 5). The RSS value, rss, is plotted for different values of lambda <= 1.0 in the following graph: Graph of RSS versus Lambda for Copper ETF The residual sum of squares decreased as λ increases. The curve seems to be reaching for a minimum around λ=1. The case of λ = 0 corresponds to the least squares regression. Next, let's plot the RSS value for λ varying between 1 and 100: Graph RSS versus large value Lambda for Copper ETF This time around RSS increases with λ before reaching a maximum for λ > 60. This behavior is consistent with other findings (refer to Lecture 5: Model selection and assessment, a lecture by H. Bravo and R. Irizarry from department of Computer Science, University of Maryland, in 2010, available at http://www.cbcb.umd.edu/~hcorrada/PracticalML/pdf/lectures/selection.pdf). As λ increases, the overfitting gets more expensive, and therefore, the RSS value increases. The regression weights can by simply outputted as follows: regression.weights.get Let's plot the predicted price variation of the Copper ETF using the ridge regression with different value of lambda (λ): Graph of ridge regression on Copper ETF price variation with variable Lambda The original price variation of the Copper ETF Δ = price(t+1)-price(t) is plotted as λ =0. The predicted values for λ = 0.8 is very similar to the original data. The predicted values for λ = 0.8 follows the pattern of the original data with reduction of large variations (peaks and troves). The predicted values for λ = 5 corresponds to a smoothed dataset. The pattern of the original data is preserved but the magnitude of the price variation is significantly reduced. The reader is invited to apply the more elaborate K-fold validation routine and compute precision, recall, and F1 measure to confirm the findings. Summary The ridge regression is a powerful alternative to the more common least squares regression because it reduces the risk of overfitting. Contrary to the Naïve Bayes classifiers, it does not require conditional independence of the model features. Resources for Article: Further resources on this subject: Differences in style between Java and Scala code [Article] Dependency Management in SBT [Article] Introduction to MapReduce [Article]

0
0
6786

How-To Tutorials - Data

Designing User Security for Oracle PeopleSoft Applications

Flink Complex Event Processing

SAP BusinessObjects: Customizing the Dashboard

The EU commission introduces guidelines for achieving a ‘Trustworthy AI’

Setting Up the iReport Pages

Introduction to Couchbase

SQL Server Integration Services (SSIS)

How to Mine Popular Trends on GitHub using Python - Part 2

Structural Equation Modeling and Confirmatory Factor Analysis

9 recommended blockchain online courses

Trending Topics

Oracle E-Business Suite: Creating Bank Accounts and Cash Forecasts

Visualizations Using CCC

Scraping the Data

Working with Spark’s graph processing library, GraphFrames

Ridge Regression

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access