Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1210 Articles
article-image-adonet-entity-framework
Packt
23 Oct 2009
6 min read
Save for later

ADO.NET Entity Framework

Packt
23 Oct 2009
6 min read
Creating an Entity Data Model You can create the ADO.NET Entity Data Model in one of the two ways: Use the ADO.NET Entity Data Model Designer Use the command line Entity Data Model Designer called EdmGen.exe We will first take a look at how we can design an Entity Data Model using the ADO.NET Entity Data Model Designer which is a Visual Studio wizard that is enabled after you install ADO.NET Entity Framework and its tools. It provides a graphical interface that you can use to generate an Entity Data Model. Creating the Payroll Entity Data Model using the ADO.NET Entity Data Model Designer Here are the tables of the 'Payroll' database that we will use to generate the data model: Employee Designation Department Salary ProvidentFund To create an entity data model using the ADO.NET Entity Data Model Designer, follow these simple steps: Open Visual Studio.NET and create a solution for a new web application project as seen below and save with a name. Switch to the Solution Explorer, right click and click on Add New Item as seen in the following screenshot: Next, select ADO.NET Entity Data Model from the list of the templates displayed as shown in the following screenshot:   Name the Entity Data Model PayrollModel and click on Add. Select Generate from database from the Entity Data Model Wizard as shown in the following screenshot: Note that you can also use the Empty model template to create the Entity Data Model yourself. If you select the Empty Data Model template and click on next, the following screen appears: As you can see from the above figure, you can use this template to create the Entity Data Model yourself. You can create the Entity Types and their relationships manually by dragging items from the toolbox. We will not use this template in our discussion here. So, let's get to the next step. Click on Next in the Entity Data Model Wizard window shown earlier. The modal dialog box will now appear and prompts you to choose your connection as shown in the following figure: Click on New Connection Now you will need to specify the connection properties and parameters as shown in the following figure: We will use a dot to specify the database server name. This implies that we will be using the database server of the localhost, which is the current system in use. After you specify the necessary user name, password, and the server name, you can test your connection using the Test Connection button. When you do so, the message Test connection succeeded gets displayed in the message box as shown in the previous figure. When you click on OK on the Test connection dialog box, the following screen appears: <connectionStrings> <add name="PayrollEntities" connectionString="metadata=res:// *; provider=System.Data.SqlClient;provider connection string=&quot; Data Source=.;Initial Catalog=Payroll;User ID=sa;Password=joydip1@3; MultipleActiveResultSets=True&quot;" providerName="System.Data.EntityClient" /> </connectionStrings> Note the Entity Connection String generated automatically. This connection string will be saved in the ConnectionStrings section of your application's web.config file. This is how it will look like: When you click on Next in the previous figure, the following screen appears: Expand the Tables node and specify the database objects that you require in the Entity Data Model to be generated as shown in the following figure: Click on Finish to generate the Entity Data Model. Here is the output displayed in the Output Window while the Entity Data Model is being generated: Your Entity Data Model has been generated and saved in a file named PayrollModel.edmx. We are done creating our first Entity Data Model using the ADO.NET Entity Data Model Designer tool. When you open the Payroll Entity Data Model that we just created in the designer view, it will appear as shown in the following figure: Note how the Entity Types in the above model are related to one another. These relationships have been generated automatically by the Entity Data Model Designer based on the relationships between the tables of the Payroll database. In the next section, we will learn how we can create an Entity Data Model using the EdmGen.exe command line tool. Creating the Payroll Data Model Using the EdmGen Tool We will now take a look at how to create a data model using the Entity Data Model generation tool called EdmGen. The EdmGen.exe command line tool can be used to do one or more of the following: Generate the .cdsl, .msl, and .ssdl files as part of the Entity Data Model Generate object classes from a .csdl file Validate an Entity Data Model The EdmGen.exe command line tool generates the Entity Data Model as a set of three files: .csdl, .msl, and .ssdl. If you have used the ADO.NET Entity Data Model Designer to generate your Entity Data Model, the .edmx file generated will contain the CSDL, MSL, and the SSDL sections. You will have a single .edmx file that bundles all of these sections into it. On the other hand, if you use the EdmGen.exe tool to generate the Entity Data Model, you would find three distinctly separate files with .csdl, .msl or .ssdl extensions. Here is a list of the major options of the EdmGen.exe command line tool: Option Description /help Use this option to display help on all the possible options of this tool. The short form is /? /language:CSharp Use this option to generate code using C# language /language:VB Use this option to generate code using VB language /provider:<string> Use this option to specify the name of the ADO.NET data provider that you would like to use. /connectionstring: <connection string> Use this option to specify the connection string to be used to connect to the database /namespace:<string> Use this option to specify the name of the namespace /mode:FullGeneration Use this option to generate your CSDL, MSL, and SSDL objects from the database schema /mode:EntityClassGeneration Use this option to generate your entity classes from a given CSDL file /mode:FromSsdlGeneration Use this option to generate MSL, CSDL, and Entity Classes from a given SSDL file /mode:ValidateArtifacts Use this option to validate the CSDL, SSDL, and MSL files /mode:ViewGeneration Use this option to generate mapping views from the CSDL, SSDL, and MSL files  
Read more
  • 0
  • 0
  • 3771

article-image-what-new-12c
Packt
30 Sep 2013
23 min read
Save for later

What is New in 12c

Packt
30 Sep 2013
23 min read
(For more resources related to this topic, see here.) Oracle Database 12c has introduced many new features and enhancements for backup and recovery. This article will introduce you to some of them and you will have the opportunity to learn in more detail how they could be used in real life situations. But I cannot start talking about Oracle 12 c without talking first about a revolutionary whole new concept that was introduced with this new version of the database product, called Multitenant Container Database( CDB ) that will contain two or more pluggable databases ( PDB ). When a container database only contains one PDB it is called Single Tenant Container Database. You can also have your database on Oracle 12c using the same format as before 12c, it will be called non-CDB database and will not allow the use of PDBs. Pluggable database We are now able to have multiple databases sharing a single instance and Oracle binaries. Each of the databases will be configurable to a degree and will allow some parameters to be set specifically for themselves (due that they will share the same initialization parameter file) and what is better, each database will be completely isolated from each other without either knowing that the other exists. A CDB is a single physical database that contains a root container with the main Oracle data dictionary and at least one PDB with specific application data. A PDB is a portable container with its own data dictionary, including metadata and internal links to the system-supplied objects in the root container, and this PDB will appear to an Oracle Net client as a traditional Oracle database. The CDB also contains a PDB called SEED, which is used as a template when an empty PDB needs to be created. The following figure shows an example of a CDB with five PDBs: When creating a database on Oracle 12 c , you can now create a CDB with one or more PDBs, and what is even better is that you can easily clone a PDB, or unplug it and plug it into a different server with a preinstalled CDB, if your target server is running out of resources such as CPU or memory. Many years ago, the introduction of external storage gave us the possibility to store data on external devices and the flexibility to plug and unplug them to any system independent of their OS. For example, you can connect an external device to a system using Windows XP and read your data without any problems. Later you can unplug it and connect it to a laptop running Windows 7 and you will still be able to read your data. Now with the introduction of Oracle pluggable databases, we will be able to do something similar with Oracle when upgrading a PDB, making this process simple and easy. All you will need to do to upgrade a PDB, as per example, is: Unplug your PDB (step 1 in the following figure) that is using a CDB running 12.1.0.1. Copy the PDB to the destination location with a CDB that is using a later version such as 12.2.0.1 (step 2 in the following figure). Plug the PDB to the CDB (step 3 in the following figure), and your PDB is now upgraded to 12.2.0.1. This new concept is a great solution for database consolidation and is very useful for multitenant SaaS (Software as a Service) providers, improving resource utilization, manageability, integration, and service management. Some key points about pluggable databases are: You can have many PDBs if you want inside a single container (a CDB can contain a maximum of 253 PDBs) A PDB is fully backwards compatible with an ordinary pre-12.1 database in an applications perspective, meaning that an application built for example to run on Oracle 11.1 will have no need to be changed to run on Oracle 12c A system administrator can connect to a CDB as a whole and see a single system image If you are not ready to make use of this new concept, you can still be able to create a database on Oracle 12c as before, called non-CDB (non-Container Database) Each instance in RAC opens the CDB as a whole. A foreground session will see only the single PDB it is connected to and sees it just as a non-CDB The Resource Manager is extended with some new between-PDB capabilities Fully integrated with Oracle Enterprise Manager 12c and SQL Developer Fast provisioning of new databases (empty or as a copy/clone of an existing PDB) On Clone triggers can be used to scrub or mask data during a clone process Fast unplug and plug between CDBs Fast path or upgrade by unplugging a PDB and plugging it into a different CDB already patched or with a later database version Separation of duties between DBA and application administrators Communication between PDBs is allowed via intra-CDB dblinks Every PDB has a default service with its name in one Listener An unplugged PDB carries its lineage, Opatch, encryption key info, and much more All PDBs in a CDB should use the same character set All PDBs share the same control files, SPFILE, redo log files, flashback log files, and undo Flashback PDB is not available on 12.1, it expected to be available with 12.2 Allows multitenancy of Oracle Databases, very useful for centralization, especially if using Exadata Multitenant Container Database is only available for Oracle Enterprise Edition as a payable option, all other editions of the Oracle database can only deploy non-CDB or Single Tenant Pluggable databases. RMAN new features and enhancements Now we can continue and take a fast and closer look at some of the new features and enhancements introduced in this database version for RMAN. Container and pluggable database backup and restore As we saw earlier, the introduction of Oracle 12c and the new pluggable database concept made it possible to easily centralize multiple databases maintaining the individuality of each one when using a single instance. The introduction of this new concept also forced Oracle to introduce some new enhancements to the already existent BACKUP, RESTORE, and RECOVERY commands to enable us to be able to make an efficient backup or restore of the complete CDB. This includes all PDBs or just one of more PDBs, or if you want to be more specific, you can also just backup or restore one or more tablespaces from a PDB. Some examples of how to use the RMAN commands when performing a backup on Oracle 12c are: RMAN> BACKUP DATABASE; (To backup the CBD + all PDBs) RMAN> BACKUP DATABASE root; (To backup only the CBD) RMAN> BACKUP PLUGGABLE DATABASE pdb1,pdb2; (To backup all specified PDBs) RMAN> BACKUP TABLESPACE pdb1:example; (To backup a specific tablespace in a PDB) Some examples when performing RESTORE operations are: RMAN> RESTORE DATABASE; (To restore an entire CDB, including all PDBs) RMAN> RESTORE DATABASE root; (To restore only the root container) RMAN> RESTORE PLUGGABLE DATABASE pdb1; (To restore a specific PDB) RMAN> RESTORE TABLESPACE pdb1:example; (To restore a tablespace in a PDB) Finally, some example of RECOVERY operations are: RMAN> RECOVER DATABASE; (Root plus all PDBs) RMAN> RUN { SET UNTIL SCN 1428; RESTORE DATABASE; RECOVER DATABASE; ALTER DATABASE OPEN RESETLOGS; } RMAN> RUN } RESTORE PLUGGABLE DATABASE pdb1 TO RESTORE POINT one; RECOVER PLUGGABLE DATABASE pdb1 TO RESTORE POINT one; ALTER PLUGGABLE DATABASE pdb1 OPEN RESETLOGS;} Enterprise Manager Database Express The Oracle Enterprise Manager Database Console or Database Control that many of us used to manage an entire database is now deprecated and replaced by the new Oracle Enterprise Manager Database Express. This new tool uses Flash technology and allows the DBA to easily manage the configurations, storage, security, and performance of a database. Note that RMAN, Data Pump, and the Oracle Enterprise Manager Cloud Control are now the only tools able to perform backup and recovery operations in a pluggable database environment, in other words, you cannot use the Enterprise Manager Database Express for database backup/recovery operations. Backup privileges Oracle Database 12c provides separation support for the separation of DBA duties for the Oracle Database by introducing task-specific and least privileged administrative privileges for backups that do not require the SYSDBA privilege. The new system privilege introduced with this new release is SYSBACKUP. Avoid the use of the SYSDBA privilege for backups unless it is strictly necessary. When connecting to the database using the AS SYSDBA system privilege, you are able to see any object structure and all the data within the object, whereas if you are connecting using the new system privilege AS SYSBACKUP, you will still be able to see the structure of an object but not the object data. If you try to see any data using the SYSBACKUP privilege, the ORA-01031: insufficient privileges message will be raised. Tighter security policies require a separation of duties. The new SYSBACKUP privilege facilitates the implementation of the separation of duties, allowing backup and recovery operations to be performed without implicit access to the data, so if access to the data is required for one specific user, it will need to be granted explicitly to this user. RMAN has introduced some changes when connecting to a database such as: TARGET: It will require the user to have the SYSBACKUP administrative privilege to be able to connect to the TARGET database CATALOG: As in the earlier versions a user was required to have the RECOVERY_CATALOG_OWNER role assigned to be able to connect to the RMAN catalog, now it will need to have assigned the SYSBACKUP privilege to be able to connect to the catalog AUXILIARY: It will require the SYSBACKUP administrative privilege to connect to the AUXILIARY database Some important points about the SYSBACKUP administrative privilege are: It includes permissions for backup and recovery operations It does not include data access privileges such as SELECT ANY TABLE that the SYSDBA privilege has It can be granted to the SYSBACKUP user that is created during the database installation process It's the default privilege when a RMAN connection string is issued and does not contain the AS SYSBACKUP clause: $ RMAN TARGET / Before connecting as the SYSBACKUP user created during the database creation process, you will need to unlock the account and grant the SYSBACKUP privilege to the user. When you use the GRANT command to give the SYSBACKUP privilege to a user, the username and privilege information will be automatically added to the database password file. The v$pwfile_users view contains all information regarding users within the database password file and indicates whether a user has been granted any privileged system privilege. Let's take a closer look to this view: SQL> DESC v$pwfile_users Name Null? Type ----------------------------- -------- ----------------- USERNAME VARCHAR2(30) SYSDBA VARCHAR2(5) SYSOPER VARCHAR2(5) SYSASM VARCHAR2(5) SYSBACKUP VARCHAR2(5) SYSDG VARCHAR2(5) SYSKM VARCHAR2(5) CON_ID NUMBER As you can see, this view now contains some new columns, such as: SYSBACKUP: It indicates if the user is able to connect using the SYSBACKUP privileges SYSDG: It indicates if the user is able to connect using the SYSDG (new for Data Guard) privileges SYSKM: It indicates if the user is able to connect using the SYSKM (new for Advanced Security) privileges. CON_ID: It is the ID of the current container. If 0, it will indicate that it is related to the entire CDB or to an entire traditional database (non-CDB): if the value is 1, then this user has the access only to root; if other value, then the view will identify a specific container ID. To help you clearly understand the use of the SYSBACKUP privilege, let's run a few examples to make it completely clear. Let's connect to our newly created database as SYSDBA and take a closer look at the SYSBACKUP privilege: $ sqlplus / as sysdbaSQL> SET PAGES 999SQL> SET LINES 99SQL> COL USERNAME FORMAT A21SQL> COL ACCOUNT_STATUS FORMAT A20SQL> COL LAST_LOGIN FORMAT A41 SQL> SELECT username, account_status, last_login 2 FROM dba_users 3 WHERE username = 'SYSBACKUP';USERNAME ACCOUNT_STATUS LAST_LOGIN------------ -------------------- -----------------------SYSBACKUP EXPIRED & LOCKED As you can see, the SYSBACKUP account created during the database creation is currently EXPIRED & LOCKED, you will need to unlock this account and grant the SYSBACKUP privilege to it if you want to use this user for any backup and recovery purposes: For this demo I will use the original SYSBACKUP account, but in a production environment never use the SYSBACKUP account, instead grant the SYSBACKUP privilege to the user(s) that will be responsible for the backup and recovery operations. SQL> ALTER USER sysbackup IDENTIFIED BY "demo" ACCOUNT UNLOCK; User altered. SQL> GRANT sysbackup TO sysbackup; Grant succeeded. SQL> SQL> SELECT username, account_status 2 FROM dba_users 3 WHERE account_status NOT LIKE '%LOCKED'; USERNAME ACCOUNT_STATUS --------------------- -------------------- SYS OPEN SYSTEM OPEN SYSBACKUP OPEN We can also easily identify what system privileges and roles are assigned to SYSBACKUP by executing the following SQLs: SQL> COL grantee FORMAT A20 SQL> SELECT * 2 FROM dba_sys_privs 3 WHERE grantee = 'SYSBACKUP'; GRANTEE PRIVILEGE ADM COM ------------- ----------------------------------- --- --- SYSBACKUP ALTER SYSTEM NO YES SYSBACKUP AUDIT ANY NO YES SYSBACKUP SELECT ANY TRANSACTION NO YES SYSBACKUP SELECT ANY DICTIONARY NO YES SYSBACKUP RESUMABLE NO YES SYSBACKUP CREATE ANY DIRECTORY NO YES SYSBACKUP UNLIMITED TABLESPACE NO YES SYSBACKUP ALTER TABLESPACE NO YES SYSBACKUP ALTER SESSION NO YES SYSBACKUP ALTER DATABASE NO YES SYSBACKUP CREATE ANY TABLE NO YES SYSBACKUP DROP TABLESPACE NO YES SYSBACKUP CREATE ANY CLUSTER NO YES 13 rows selected. SQL> COL granted_role FORMAT A30 SQL> SELECT * 2 FROM dba_role_privs 3 WHERE grantee = 'SYSBACKUP'; GRANTEE GRANTED_ROLE ADM DEF COM -------------- ------------------------------ --- --- --- SYSBACKUP SELECT_CATALOG_ROLE NO YES YES Where the column ADMIN_OPTION refers to if the user has or not, the ADMIN_OPTION privilege, the column DEFAULT_ROLE indicates whether or not ROLE is designated as a default role for the user, and the column COMMON refers to if it's common to all the containers and pluggable databases available. SQL and DESCRIBE As you know well, you are able to execute the SQL commands, and the PL/SQL procedures from the RMAN command line starting with Oracle 12.1, do not require the use of the SQL prefix or quotes for most SQL commands in RMAN. You can now run some simple SQL commands in RMAN such as: RMAN> SELECT TO_CHAR(sysdate,'dd/mm/yy - hh24:mi:ss') 2> FROM dual; TO_CHAR(SYSDATE,'DD) ------------------- 17/09/12 - 02:58:40 RMAN> DESC v$datafile Name Null? Type --------------------------- -------- ------------------- FILE# NUMBER CREATION_CHANGE# NUMBER CREATION_TIME DATE TS# NUMBER RFILE# NUMBER STATUS VARCHAR2(7) ENABLED VARCHAR2(10) CHECKPOINT_CHANGE# NUMBER CHECKPOINT_TIME DATE UNRECOVERABLE_CHANGE# NUMBER UNRECOVERABLE_TIME DATE LAST_CHANGE# NUMBER LAST_TIME DATE OFFLINE_CHANGE# NUMBER ONLINE_CHANGE# NUMBER ONLINE_TIME DATE BYTES NUMBER BLOCKS NUMBER CREATE_BYTES NUMBER BLOCK_SIZE NUMBER NAME VARCHAR2(513) PLUGGED_IN NUMBER BLOCK1_OFFSET NUMBER AUX_NAME VARCHAR2(513) FIRST_NONLOGGED_SCN NUMBER FIRST_NONLOGGED_TIME DATE FOREIGN_DBID NUMBER FOREIGN_CREATION_CHANGE# NUMBER FOREIGN_CREATION_TIME DATE PLUGGED_READONLY VARCHAR2(3) PLUGIN_CHANGE# NUMBER PLUGIN_RESETLOGS_CHANGE# NUMBER PLUGIN_RESETLOGS_TIME DATE CON_ID NUMBER RMAN> ALTER TABLESPACE users 2> ADD DATAFILE '/u01/app/oracle/oradata/cdb1/pdb1/user02.dbf' size 50M; Statement processed Remember that the SYSBACKUP privilege does not grant access to the user tables or views, but the SYSDBA privilege does. Multi-section backups for incremental backups Oracle Database 11g introduced multi-section backups to allow us to backup and restore very large files using backup sets (remember that Oracle datafiles can be up to 128 TB in size). Now with Oracle Database 12c , we are able to make use of image copies when creating multi-section backups as a complement of the previous backup set functionality. This helps us to reduce image copy creation time for backups, transporting tablespaces, cloning, and doing a TSPITR (tablespace point-in-time recovery), it also improves backups when using Exadata. The main restrictions to make use of this enhancement are: The COMPATIBLE initialization parameter needs to be set to 12.0 or higher to make use of the new image copy multi-section backup feature This is only available for datafiles and cannot be used to backup control or password files Not to be used with a large number of parallelisms when a file resides on a small number of disks, to avoid each process to compete with each other when accessing the same device Another new feature introduced with multi-section backups is the ability to create multi-section backups for incremental backups. This will allow RMAN to only backup the data that has changed since the last backup, consequently enhancing the performance of multi-section backups due that they are processed independently, either serially or in parallel. Network-based recovery Restoring and recovering files over the network is supported starting with Oracle Database 12c . We can now recover a standby database and synchronize it with its primary database via the network without the need to ship the archive log files. When the RECOVER command is executed, an incremental backup is created on the primary database. It is then transferred over the network to the physical standby database and applied to the standby database to synchronize it within the primary database. RMAN uses the SCN from the standby datafile header and creates the incremental backup starting from this SCN on the primary database, in other words, only bringing the information necessary to the synchronization process. If block change tracking is enabled for the primary database, it will be used while creating the incremental backup making it faster. A network-based recovery can also be used to replace any missing datafiles, control files, SPFILE, or tablespaces on the primary database using the corresponding entity from the physical standby to the recovery operation. You can also use multi-section backup sets, encryption, or even compression within a network-based recovery. Active Duplicate The Active Duplicate feature generates an online backup on the TARGET database and directly transmits it via an inter-instance network connection to the AUXILIARY database for duplication (not written to disk in the source server). Consequently, this reduces the impact on the TARGET database by offloading the data transfer operation to the AUXILIARY database, also reducing the duplication time. This very useful feature has now received some important enhancements. In Oracle 11 g when this feature was initially introduced, it only allowed us to use a push process based on the image copies. Now it allows us to make use of the already known push process or to make use of the newly introduced pull process from the AUXILIARY database that is based on backup sets (the pull process is now the new default and automatically copies across all datafiles, control files, SPFILE and archive log files). Then it performs the restore of all files and uses a memory script to complete the recovery operation and open the AUXILIARY database. RMAN will dynamically determine, based on your DUPLICATE clauses, which process will be used (push or pull). It is very possible that soon Oracle will end deprecating the push process on the future releases of the database. You can now choose your choice of compression, section size, and encryption to be used during the Active Duplication process. For example, if you specify the SET ENCRYPTION option before the DUPLICATE command, all the backups sent from the target to the auxiliary database will be encrypted. For an effective use of parallelism, allocate more AUXILIARY channels instead of TARGET channels as in the earlier releases. Finally, another important new enhancement is the possibility to finish the duplication process with the AUXILIARY database in not open state (the default is to open the AUXILIARY database after the duplication is completed). This option is very useful when you are required to: Modify the block change tracking Configure fast incremental backups or flashback database settings Move the location of the database, for example, to ASM Upgrade the AUXILIARY database (due that the database must not be open with reset logs prior to applying the upgrade scripts) Or when you know that the attempt to open the database would produce errors To make it clearer, let's take a closer look at what operations RMAN will perform when a DUPLICATE command is used: Create an SPFILE string for the AUXILIARY instance. Mount the backup control file. Restore the TARGET datafiles on the AUXILIARY database. Perform incomplete recovery using all the available incremental backups and archived redo log files. Shut down and restart the AUXILIARY instance in the NOMOUNT mode. Create a new control file, create and store the new database ID in the datafiles (it will not happen if the FOR STANDBY clause is in use). Mount and opens the duplicate database using the RESETLOGS option, and create the online redo log files by default. If the NOOPEN option is used, the duplicated database will not be opened with RESETLOGS and will remain in the MOUNT state. Here are some examples of how to use the DUPLICATE command with PDBs: RMAN> DUPLICATE TARGET DATABASE TO <CDB1>; RMAN> DUPLICATE TARGET DATABASE TO <CDB1> PLUGGABLE DATABASE <PDB1>, <PDB2>, <PDB3>; Support for the third-party snapshot In the past when using a third-party snapshot technology to make a backup or clone of a database, you were forced to change the database to the backup mode (BEGIN BACKUP) before executing the storage snapshot. This requirement is no longer necessary if the following conditions are met: The database crash is consistent at the point of the snapshot Write ordering is preserved for each file within the snapshot The snapshot stores the time at which the snapshot is completed If a storage vendor cannot guarantee compliance with the conditions discussed, then you must place your database in backup mode before starting with the snapshot. The RECOVER command now has a newly introduced option called SNAPSHOT TIME that allows RMAN to recover a snapshot that was taken without being in backup mode to a consistent point-in-time. Some examples of how to use this new option are: RMAN> RECOVER DATABASE UNTIL TIME '10/12/2012 10:30:00' SNAPSHOT TIME '10/12/2012 10:00:00'; RMAN> RECOVER DATABASE UNTIL CANCEL SNAPSHOT TIME '10/12/2012 10:00:00'; Only trust your backups after you ensure that they are usable for recovery. In other words, always test your backup methodology first, ensuring that it can be used in the future in case of a disaster. Cross-platform data transport Starting with Oracle 12c, transporting data across platforms can be done making use of backup sets and also create cross-platform inconsistent tablespace backups (when the tablespace is not in the read-only mode) using image copies and backup sets. When using backup sets, you are able to make use of the compression and multi-section options, reducing downtime for the tablespace and the database platform migrations. RMAN does not catalog backup sets created for cross-platform transport in the control file, and always takes into consideration the endian format of the platforms and the database open mode. Before creating a backup set that will be used for a cross-platform data transport, the following prerequisites should be met: The compatible parameter in the SPFILE string should be 12.0 or greater The source database must be open in read-only mode when transporting an entire database due that the SYS and SYSAUX tablespaces will participate in the transport process If using Data Pump, the database must be open in read-write mode You can easily check the current compatible value and open_mode of your database by running the following SQL commands: SQL> SHOW PARAMETER compatible NAME TYPE VALUE ---------------------- ----------- ---------------------- compatible string 12.0.0.0.0 SQL> SELECT open_mode FROM v$database; OPEN_MODE -------------------- READ WRITE When making use of the FOR TRANSPORT or the TO PLATFORM clauses in the BACKUP command, you cannot make use of the following clauses: CUMULATIVE forRecoveryOfSpec INCREMENTAL LEVEL n keepOption notBackedUpSpec PROXY SECTION SIZE TAG VALIDATE Table recovery In previous versions of Oracle Database, the process to recover a table to a specific point-in-time was never easy. Oracle has now solved this major issue by introducing the possibility to do a point-in-time recovery of a table, group of tables or even table partitions without affecting the remaining database objects using RMAN. This makes the process easier and faster than ever before. Remember that Oracle has previously introduced features such as database point-in-time recovery ( DBPITR ), tablespace point-in-time recovery ( TSPITR ) and Flashback database; this is an evolution of the same technology and principles. The recovery of tables and table partitions is useful in the following situations: To recover a very small set of tables to a particular point-in-time To recover a tablespace that is not self-contained to a particular point-in-time, remember that TSPITR can only be used if the tablespace is self-contained To recover tables that are corrupted or dropped with the PURGE option, so the FLASHBACK DROP functionality is not possible to be used When logging for a Flashback table is enabled but the flashback target time or SCN is beyond the available undo To recover data that was lost after a data definition language ( DDL ) operation that changed the structure of a table To recover tables and table partitions from a RMAN backup, the TARGET database should be (prerequisites): At the READ/WRITE mode In the ARCHIVELOG mode The COMPATIBLE parameter should be set to 12.0 or higher You cannot recover tables or table partitions from the SYS, SYSTEM and SYSAUX schemas, or even from a standby database. Now let's take a closer look at the steps to do a table or table partitions recovery using RMAN: First check if all the prerequisites to do a table recovery are met. Start a RMAN session with the CONNECT TARGET command. Use the RECOVER TABLE command with all the required clauses. RMAN will determine which backup contains the data that needs to be recovered based on the point-in-time specified. RMAN creates an AUXILIARY instance, you can also specify the location of the AUXILIARY instance files using the AUXILIARY DESTINATION or SET NEWNAME clause. RMAN recovers the specified objects into the AUXILIARY instance. RMAN creates a Data Pump export dump file that contains the objects. RMAN imports the recovered objects from the dump file previously created into the TARGET database. If you want to manually import the objects to the TARGET database, you can make use of the NOTABLEIMPORT clause in the RECOVER command to achieve this goal. RMAN optionally offers the possibility to rename the recovered objects in the TARGET database using the REMAP TABLE clause, or to import the recovered objects to a different tablespace using the REMAP TABLESPACE clause. An example of how to use the new RECOVER TABLE command is: RMAN> RECOVER TABLE SCOTT.test UNTIL SEQUENCE 5481 THREAD 2 AUXILARY DESTINATION '/tmp/recover' REMAP TABLE SCOTT.test:my_test;
Read more
  • 0
  • 0
  • 3755

article-image-data-visualization
Packt
27 Oct 2014
8 min read
Save for later

Data visualization

Packt
27 Oct 2014
8 min read
Data visualization is one of the most important tasks in data science track. Through effective visualization we can easily uncover underlying pattern among variables with doing any sophisticated statistical analysis. In this cookbook we have focused on graphical analysis using R in a very simple way with each independent example. We have covered default R functionality along with more advance visualization techniques such as lattice, ggplot2, and three-dimensional plots. Readers will not only learn the code to produce the graph but also learn why certain code has been written with specific examples. R Graphs Cookbook Second Edition written by Jaynal Abedin and Hrishi V. Mittal is such a book where the user will learn how to produce various graphs using R and how to customize them and finally how to make ready for publication. This practical recipe book starts with very brief description about R graphics system and then gradually goes through basic to advance plots with examples. Beside the R default graphics this recipe book introduces advance graphic system such as lattice and ggplot2; the grammar of graphics. We have also provided examples on how to inspect large dataset using advanced visualization such as tableplot and three dimensional visualizations. We also cover the following topics: How to create various types of bar charts using default R functions, lattice and ggplot2 How to produce density plots along with histograms using lattice and ggplot2 and customized them for publication How to produce graphs of frequency tabulated data How to inspect large dataset by simultaneously visualizing numeric and categorical variables in a single plot How to annotate graphs using ggplot2 (For more resources related to this topic, see here.) This recipe book is targeted to those reader groups who already exposed to R programming and want to learn effective graphics with the power of R and its various libraries. This hands-on guide starts with very short introduction to R graphics system and then gets straight to the point – actually creating graphs, instead of just theoretical learning. Each recipe is specifically tailored to full fill reader’s appetite for visually representing the data in the best way possible. Now, we will present few examples so that you can have an idea about the content of this recipe book: The ggplot2 R package is based on The Grammar of Graphics by Leland Wilkinson, Springer). Using this package, we can produce a variety of traditional graphics, and the user can produce their customized graphs as well. The beauty of this package is in its layered graphics facilities; through the use of layered graphics utilities, we can produce almost any kind of data visualization. Recently, ggplot2 is the most searched keyword in the R community, including the most popular R blog (www.r-bloggers.com). The comprehensive theme system allows the user to produce publication quality graphs with a variety of themes of choice. If we want to explain this package in a single sentence, then we can say that if whatever we can think about data visualization can be structured in a data frame, the visualization is a matter of few seconds. In the specific chapter on ggplot2 , we will see different examples and use themes to produce publication quality graphs. However, in this introductory chapter, we will show you one of the important features of the ggplot2 package that produces various types of graphs. The main function is ggplot(), but with the help of a different geom function, we can easily produce different types of graphs, such as the following: geom_point(): This will create scatter plot geom_line(): This will create a line chart geom_bar(): This will create a bar chart geom_boxplot(): This will create a box plot geom_text(): This will write certain text inside the plot area Now, we will see a simple example of the use of different geom functions with the default R mtcars dataset: # loading ggplot2 library library(ggplot2) # creating a basic ggplot object p <- ggplot(data=mtcars) # Creating scatter plot of mpg and disp variable p1 <- p+geom_point(aes(x=disp,y=mpg)) # creating line chart from the same ggplot object but different # geom function p2 <- p+geom_line(aes(x=disp,y=mpg)) # creating bar chart of mpg variable p3 <- p+geom_bar(aes(x=mpg)) # creating boxplot of mpg over gear p4 <- p+geom_boxplot(aes(x=factor(gear),y=mpg)) # writing certain text into the scatter plot p5 <- p1+geom_text(x=200,y=25,label="Scatter plot") The visualization of the preceding five plot will look like the following figure: Visualizing an empirical Cumulative Distribution function The empirical Cumulative Distribution function (CDF) is the non-parametric maximum-likelihood estimation of the CDF. In this recipe, we will see how the empirical CDF can be produced. Getting ready To produce this plot, we need to use the latticeExtra library. We will use the simulated dataset as shown in the following code: # Set a seed value to make the data reproducible set.seed(12345) qqdata <-data.frame(disA=rnorm(n=100,mean=20,sd=3),                disB=rnorm(n=100,mean=25,sd=4),                disC=rnorm(n=100,mean=15,sd=1.5),                age=sample((c(1,2,3,4)),size=100,replace=T),                sex=sample(c("Male","Female"),size=100,replace=T),                 econ_status=sample(c("Poor","Middle","Rich"),                size=100,replace=T)) How to do it… To plot an empirical CDF, we first need to call the latticeExtra library (note that this library has a dependency on RColorBrewer). Now, to plot the empirical CDF, we can use the following simple code: library(latticeExtra) ecdfplot(~disA|sex,data=qqdata) Graph annotation with ggplot To produce publication-quality data visualization, we often need to annotate the graph with various texts, symbols, or even shapes. In this recipe, we will see how we can easily annotate an existing graph. Getting ready In this recipe, we will use the disA and disD variables from ggplotdata. Let's call ggplotdata for this recipe. We also need to call the grid and gridExtra libraries for this recipe. How to do it... In this recipe, we will execute the following annotation on an existing scatter plot. So, the whole procedure will be as follows: Create a scatter plot Add customized text within the plot Highlight certain region to indicate extreme values Draw a line segment with an arrow within the scatter plot to indicate a single extreme observation Now, we will implement each of the steps one by one: library(grid) library(gridExtra) # creating scatter plot and print it annotation_obj <- ggplot(data=ggplotdata,aes(x=disA,y=disD))+geom_point() annotation_obj # Adding custom text at (18,29) position annotation_obj1 <- annotation_obj + annotate(geom="text",x=18,y=29,label="Extreme value",size=3) annotation_obj1 # Highlight certain regions with a box annotation_obj2 <- annotation_obj1+ annotate("rect", xmin = 24, xmax = 27,ymin=17,ymax=22,alpha = .2) annotation_obj2 # Drawing line segment with arrow annotation_obj3 <- annotation_obj2+ annotate("segment",x = 16,xend=17.5,y=25,yend=27.5,colour="red", arrow = arrow(length = unit(0.5, "cm")),size=2) annotation_obj3 The preceding four steps are displayed in the following single graph: How it works... The annotate() function takes input of a geom such as “segment”, “text” etc, and then it takes another input regarding position of that geom that is where to draw or where to place.. In this particular recipe, we used three geom instances, such as text to write customized text within the plot, rect to highlight a certain region in the plot, and segment to draw an arrow. The alpha argument represents the transparency of the region and size argument to represent the size of the text and line width of the line segment. Summary This article just gives a sample recipe of what kind of recipes are included in the book, and how the structure of each recipe is. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [Article] First steps with R [Article] Aspects of Data Manipulation in R [Article]
Read more
  • 0
  • 0
  • 3747

article-image-schemas-and-models
Packt
27 Aug 2013
12 min read
Save for later

Schemas and Models

Packt
27 Aug 2013
12 min read
(For more resources related to this topic, see here.) So what is a schema? At its simplest, a schema is a way to describe the structure of data. Typically this involves giving each piece of data a label, and stating what type of data it is, for example, a number, date, string, and so on. In the following example, we are creating a new Mongoose schema called userSchema. We are stating that a database document using this schema will have three pieces of data, which are as follows: name: This data will contain a string email: This will also contain a string value createdOn: This data will contain a date The following is the schema definition: var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date}); Field sizes Note that, unlike some other systems there is no need to set the field size. This can be useful if you need to change the amount of data stored in a particular object. For example, your system might impose a 16-character limit on usernames, so you set the size of the field to 16 characters. Later, you realize that you want to encrypt the usernames, but this will double the length of the data stored. If your database schema uses fixed field sizes, you will need to refactor it, which can take a long time on a large database. With Mongoose, you can just start encrypting that data object without worrying about it. If you're storing large documents, you should bear in mind that MongoDB imposes a maximum document size of 16 MB. There are ways around even this limit, using the MongoDB GridFS API. Data types allowed in schemas There are eight types of data that can—by default—be set in a Mongoose schema. These are also referred to as SchemaTypes; they are: String Number Date Boolean Buffer ObjectId Mixed Array The first four SchemaTypes are self-explanatory, but let's take a quick look at them all. String This SchemaType stores a string value, UTF-8 encoded. Number This SchemaType stores a number value, with restrictions. Mongoose does not natively support long and double datatypes for example, although MongoDB does. However, Mongoose can be extended using plugins to support these other types. Date This SchemaType holds a date and time object, typically returned from MongoDB as an ISODate object, for example, ISODate("2013-04-03T12:56:26.009Z"). Boolean This SchemaType has only two values: true or false. Buffer This SchemaType is primarily used for storing binary information, for example, images stored in MongoDB. ObjectId This SchemaType is used to assign a unique identifier to a key other than _id. Rather than just specifying the type of ObjectId you need to specify the fully qualified version Schema.Types.ObjectId. For example: projectSchema.add({owner: mongoose.Schema.Types.ObjectId}); Mixed A mixed data object can contain any type of data. It can be declared either by setting an empty object, or by using the fully qualified Schema.Types.Mixed. These following two commands will do the same thing: vardjSchema= new mongoose.Schema({mixedUp: {}});vardjSchema= new mongoose.Schema({mixedUp: Schema.Types.Mixed}); While this sounds like it might be great, there is a big caveat. Changes to data of mixed type cannot be automatically detected by Mongoose, so it doesn't know that it needs to save them. Tracking changes to Mixed type As Mongoose can't automatically see changes made to mixed type of data, you have to manually declare when the data has changed. Fortunately, Mongoose exposes a method called markModified to do just this, passing it the path of the data object that has changed. dj.mixedUp = { valueone: "a new value" };dj.markModified('mixedUp');dj.save(); Array The array datatype can be used in two ways. First, a simple array of values of the same data type, as shown in the following code snippet: var userSchema = new mongoose.Schema({name: String,emailAddresses: [String]}); Second, the array datatype can be used to store a collection of subdocuments using nested schemas. Here's an example in the following of how this can work: var emailSchema = new mongoose.Schema({email: String,verified: Boolean});var userSchema = new mongoose.Schema({name: String,emailAddresses: [emailSchema]}); Warning – array defined as mixed type A word of caution. If you declare an empty array it will be treated as mixed type, meaning that Mongoose will not be able to automatically detect any changes made to the data. So avoid these two types of array declaration, unless you intentionally want a mixed type. var emailSchema = new mongoose.Schema({addresses: []});var emailSchema = new mongoose.Schema({addresses: Array}); Custom SchemaTypes If your data requires a different datatype which is not covered earlier in this article, Mongoose offers the option of extending it with custom SchemaTypes. The extension method is managed using Mongoose plugins. Some examples of SchemaType extensions that have already been created are: long, double, RegEx, and even email. Where to write the schemas As your schemas sit on top of Mongoose, the only absolute is that they need to be defined after Mongoose is required. You don't need an active or open connection to define your schemas. That being said it is advisable to make your connection early on, so that it is available as soon as possible, bearing in mind that remote database or replica sets may take longer to connect than your localhost development server. While no action can be taken on the database through the schemas and models until the connection is open, Mongoose can buffer requests made from when the connection is defined. Mongoose models also rely on the connection being defined, so there's another reason to get the connection set up early in the code and then define the schemas and models. Writing a schema Let's write the schema for a User in our MongoosePM application. The first thing we have to do is declare a variable to hold the schema. I recommend taking the object name (for example, user or project) and adding Schema to the end of it. This makes following the code later on super easy. The second thing we need to do is create a new Mongoose schema object to assign to this variable. The skeleton of this is as follows: var userSchema = new mongoose.Schema({ }); We can add in the basic values of name, email, and createdOn that we looked at earlier, giving us our first user schema definition. var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date}); Modifying an existing schema Suppose we run the application with this for a while, and then decide that we want to record the last time each user logged on, and the last time their record was modified. No problem! We don't have to refactor the database or take it offline while we upgrade the schema, we simply add a couple of entries to the Mongoose schema. If a key requested in the schema doesn't exist, neither Mongoose nor MongoDB will throw errors, Mongoose will just return null values. When saving the MongoDB documents, the new keys and values will be added and stored as required. If the value is null, then the key is not added. So let's add modifiedOn and lastLogin to our userSchema: var userSchema = new mongoose.Schema({name: String,email: String,createdOn: Date,modifiedOn: Date,lastLogin: Date}); Setting a default value Mongoose allows us to set a default value for a data key when the document is first created. Looking at our schema created earlier, a possible candidate for this is createdOn. When a user first signs up, we want the date and time to be set. We could do this by adding a timestamp to the controller function when we create a user, or to make a point we can modify the schema to set a default value. To do this, we need to change the information we are sending about the createdOn data object. What we have currently is: createdOn: Date This is short for: createdOn: { type: Date } We can add another entry to this object to set a default value here, using the JavaScript Date object: createdOn: { type: Date, default: Date.now } Now every time a new user is created its createdOn value will be set to the current date and time. Note that in JavaScript default is a reserved word. While the language allows reserved words to be used as keys, some IDEs and linters regard it as an error. If this causes issues for you or your environment, you can wrap it in quotes, like in the following code snippet: createdOn: { type: Date, 'default': Date.now } Only allowing unique entries If we want to ensure that there is only ever one user per e-mail address, we can specify that the email field should be unique. email: {type: String, unique:true} With this in place, when saving to the database, MongoDB will check to see if the e-mail value already exists in another document. If it finds it, MongoDB (not Mongoose) will return an E11000 error. Note that this approach also defines a MongoDB index on the email field. Our final User schema Your userSchema should now look like the following: var userSchema = new mongoose.Schema({name: String,email: {type: String, unique:true},createdOn: { type: Date, default: Date.now },modifiedOn: Date,lastLogin: Date}); A corresponding document from the database would look like the following (line breaks are added for readability): { "__v" : 0,"_id" : ObjectId("5126b7a1f8a44d1e32000001"),"createdOn" : ISODate("2013-02-22T00:11:13.436Z"),"email" : "simon@theholmesoffice.com","lastLogin" : ISODate("2013-04-03T12:54:42.734Z"),"modifiedOn" : ISODate("2013-04-03T12:56:26.009Z"),"name" : "Simon Holmes" } What's that "__v" thing? You may have noticed a data entity in the document that we didn't set: __v. This is an internal versioning number automatically set by Mongoose when a document is created. It doesn't increment when a document is changed, but instead is automatically incremented whenever an array within the document is updated in such a way that might cause the indexed position of some of the entries to have changed. Why is this needed? When working with an array you will typically access the individual elements through their positional index, for example, myArray[3]. But what happens if somebody else deletes the element in myArray[2] while you are editing the data in myArray[3]? Your original data is now contained in myArray[2] but you don't know this, so you quite happily overwrite whatever data is now stored in myArray[3]. The __v gives you a method to be able to sanity check this, and prevent this scenario from happening. Defining the Project schema As part of our MongoosePM application we also need to think about Projects. After all, PM here does stand for Project Manager. Let's take what we've learned and create the Project schema. We are going to want a few types of data to start with: projectName: A string containing the name of the project. createdOn: The date when the document was first created and saved. This option is set to automatically save the current date and time. modifiedOn: The date and time when the document was last changed. createdBy: A string that will for now contain the unique ID of the user who created the project. tasks: A string to hold task information. Transforming these requirements into a Mongoose schema definition, we create this in the following: varprojectSchema = new mongoose.Schema({projectName: String,createdOn: Date,modifiedOn: { type: Date, default: Date.now },createdBy: String,tasks: String}); This is our starting point, and we will build upon it. For now we have these basic data objects as mentioned previously in this article. Here's an example of a corresponding document from the database (line breaks added for readability): { "projectName" : "Another test","createdBy" : "5126b7a1f8a44d1e32000001","createdOn" : ISODate("2013-04-03T17:47:51.031Z"),"tasks" : "Just a simple task","_id" : ObjectId("515c6b47596acf8e35000001"),"modifiedOn" : ISODate("2013-04-03T17:47:51.032Z"),"__v" : 0 } Improving the Project schema Throughout the rest of the article we will be improving this schema, but the beauty of using Mongoose is that we can do this relatively easily. Putting together a basic schema like this to build upon is a great approach for prototyping—you have the data you need there, and can add complexity where you need, when you need it. Building models A single instance of a model maps directly to a single document in the database. With this 1:1 relationship, it is the model that handles all document interaction—creating, reading, saving, and deleting. This makes the model a very powerful tool. Building the model is pretty straightforward. When using the default Mongoose connection we can call the mongoose.model command, passing it two arguments: The name of the model The name of the schema to compile So if we were to build a model from our user schema we would use this line: mongoose.model( 'User', userSchema ); If you're using a named Mongoose connection, the approach is very similar. adminConnection.model( 'User', userSchema ); Instances It is useful to have a good understanding of how a model works. After building the User model, using the previous line we could create two instances. var userOne = new User({ name: 'Simon' });var userTwo = new User({ name: 'Sally' }); Summary In this article, we have looked at how schemas and models relate to your data. You should now understand the roles of both schemas and models. We have looked at how to create simple schemas and the types of data they can contain. We have also seen that it is possible to extend this if the native types are not enough. In the MongoosePM project, you should now have added a User schema and a Project schema, and built models of both of these. Resources for Article: Further resources on this subject: Understanding Express Routes [Article] Validating and Using the Model Data [Article] Creating Your First Web Page Using ExpressionEngine: Part 1 [Article]
Read more
  • 0
  • 0
  • 3735

article-image-making-simple-curl-request-simple
Packt
01 Aug 2013
5 min read
Save for later

Making a simple cURL request (Simple)

Packt
01 Aug 2013
5 min read
(For more resources related to this topic, see here.) Getting ready In this article we will use cURL to request and download a web page from a server. How to do it... Enter the following code into a new PHP project: <?php // Function to make GET request using cURL function curlGet($url) { $ch = curl_init(); // Initialising cURL session // Setting cURL options curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); curl_setopt($ch, CURLOPT_URL, $url); $results = curl_exec($ch); // Executing cURL session curl_close($ch); // Closing cURL session return $results; // Return the results } $packtPage = curlGet('http://www.packtpub.com/oop-php-5/book'); echo $packtPage; ?> Save the project as 2-curl-request.php (ensure you use the .php extension!). Execute the script. Once our script has completed, we will see the source code of http://www.packtpub.com/oop-php-5/book displayed on the screen. How it works... Let's look at how we performed the previously defined steps: The first line, <?php, and the last line,?>, indicate where our PHP code block will begin and end. All the PHP code should appear between these two tags. Next, we create a function called curlGet(), which accepts a single parameter $url, the URL of the resource to be requested. Running through the code inside the curlGet() function, we start off by initializing a new cURL session as follows: $ch = curl_init(); We then set our options for cURL as follows: curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE); // Tells cURL to return the results of the request (the source code of the target page) as a string. curl_setopt($ch, CURLOPT_URL, $url); // Here we tell cURL the URL we wish to request, notice that it is the $url variable that we passed into the function as a parameter. We execute our cURL request, storing the returned string in the $results variable as follows: $results = curl_exec($ch); Now that the cURL request has been made and we have the results, we close the cURL session by using the following code: curl_close($ch); At the end of the function, we return the $results variable containing our requested page, out of the function for using in our script. return $results; After the function is closed we are able to use it throughout the rest of our script. Later, deciding on the URL we wish to request, http://www.packtpub.com/oop-php-5/book , we execute the function, passing the URL as a parameter and storing the returned data from the function in the $packtPage variable as follows: $packtPage = curlGet('http://www.packtpub.com/oop-php-5/book'); Finally, we echo the contents of the $packtPage variable (the page we requested) to the screen by using the following code: echo $packtPage; There's more... There are a number of different HTTP request methods which indicate the server the desired response, or the action to be performed. The request method being used in this article is cURLs default GET request. This tells the server that we would like to retrieve a resource. Depending on the resource we are requesting, a number of parameters may be passed in the URL. For example, when we perform a search on the Packt Publishing website for a query, say, php, we notice that the URL is http://www.packtpub.com/books?keys=php. This is requesting the resource books (the page that displays search results) and passing a value of php to the keys parameter, indicating that the dynamically generated page should show results for the search query php. More cURL Options Of the many cURL options available, only two have been used in our preceding code. They are CURLOPT_RETURNTRANSFER and CURLOPT_URL. Though we will cover many more throughout the course of this article, some other options to be aware of, that you may wish to try out, are listed in the following table: Option Name Value Purpose CURLOPT_FAILONERROR TRUE or FALSE If a response code greater than 400 is returned, cURL will fail silently. CURLOPT_FOLLOWLOCATION TRUE or FALSE If Location: headers are sent by the server, follow the location. CURLOPT_USERAGENT A user agent string, for example: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; rv:15.0) Gecko/20100101 Firefox/15.0.1' Sending the user agent string in your request informs the target server, which client is requesting the resource. Since many servers will only respond to 'legitimate' requests it is advisable to include one. CURLOPT_HTTPHEADER An array containing header information, for example: array('Cache-Control: max-age=0', 'Connection: keep-alive', 'Keep-Alive: 300', 'Accept-Language: en-us,en;q=0.5') This option is used to send header information with  the request and we will come across use cases for this in later recipes. A full listing of cURL options can be found on the PHP website at http://php.net/manual/en/function.curl-setopt.php. The HTTP response code An HTTP response code is the number that is returned, which corresponds with the result of an HTTP request. Some common response code values are as follows: 200: OK 301: Moved Permanently 400: Bad Request 401: Unauthorized 403: Forbidden 404: Not Found 500: Internal Server Error Summary This article covers techniques on making a simple cURL request. It is often useful to have our scrapers responding to different response code values in a different manner, for example, letting us know if a web page has moved, or is no longer accessible, or we are unauthorized to access a particular page. In this case, we can access the response of a request using cURL by adding the following line to our function, which will store the response code in the $httpResponse variable: $httpResponse = curl_getinfo($ch, CURLINFO_HTTP_CODE); Resources for Article: Further resources on this subject: A look into the high-level programming operations for the PHP language [Article] Installing PHP-Nuke [Article] Creating Your Own Theme—A Wordpress Tutorial [Article]
Read more
  • 0
  • 0
  • 3722

Packt
11 Jul 2016
10 min read
Save for later

Mining Twitter with Python – Influence and Engagement

Packt
11 Jul 2016
10 min read
In this article by Marco Bonzanini, author of the book Mastering Social Media Mining with Python, we will discussmining Twitter data. Here, we will analyze users, their connections, and their interactions. In this article, we will discuss how to measure influence and engagement on Twitter. (For more resources related to this topic, see here.) Measuring influence and engagement One of the most commonly mentioned characters in the social media arena is the mythical influencer. This figure is responsible for a paradigm shift in the recent marketing strategies (https://en.wikipedia.org/wiki/Influencer_marketing), which focus on targeting key individuals rather than the market as a whole. Influencers are typically active users within their community.In case of Twitter, an influencer tweets a lot about topics they care about. Influencers are well connected as they follow and are followed by many other users who are also involved in the community. In general, an influencer is also regarded as an expert in their area, and is typically trusted by other users. This description should explain why influencers are an important part of recent trends in marketing: an influencer can increase awareness or even become an advocate of a specific product or brand and can reach a vast number of supporters. Whether your main interest is Python programming or wine tasting, regardless how huge (or tiny) your social network is, you probably already have an idea who the influencers in your social circles are: a friend, acquaintance, or random stranger on the Internet whose opinion you trust and value because of their expertise on the given subject. A different, but somehow related, concept is engagement. User engagement, or customer engagement, is the assessment of the response to a particular offer, such as a product or service. In the context of social media, pieces of content are often created with the purpose to drive traffic towards the company website or e-commerce. Measuring engagement is important as it helps in defining and understanding strategies to maximize the interactions with your network, and ultimately bring business. On Twitter, users engage by the means of retweeting or liking a particular tweet, which in return, provides more visibility to the tweet itself. In this section, we'll discuss some interesting aspects of social media analysis regarding the possibility of measuring influence and engagement. On Twitter, a natural thought would be to associate influence with the number of users in a particular network. Intuitively, a high number of followers means that a user can reach more people, but it doesn't tell us how a tweet is perceived. The following script compares some statistics for two user profiles: import sys import json   def usage():   print("Usage:")   print("python {} <username1><username2>".format(sys.argv[0]))   if __name__ == '__main__':   if len(sys.argv) != 3:     usage()     sys.exit(1)   screen_name1 = sys.argv[1]   screen_name2 = sys.argv[2] After reading the two screen names from the command line, we will build up a list of followersfor each of them, including their number of followers to calculate the number of reachable users: followers_file1 = 'users/{}/followers.jsonl'.format(screen_name1)   followers_file2 = 'users/{}/followers.jsonl'.format(screen_name2)   with open(followers_file1) as f1, open(followers_file2) as f2:     reach1 = []     reach2 = []     for line in f1:       profile = json.loads(line)       reach1.append((profile['screen_name'], profile['followers_count']))     for line in f2:       profile = json.loads(line)       reach2.append((profile['screen_name'],profile['followers_count'])) We will then load some basic statistics (followers and statuses count) from the two user profiles: profile_file1 = 'users/{}/user_profile.json'.format(screen_name1)   profile_file2 = 'users/{}/user_profile.json'.format(screen_name2)   with open(profile_file1) as f1, open(profile_file2) as f2:     profile1 = json.load(f1)     profile2 = json.load(f2)     followers1 = profile1['followers_count']     followers2 = profile2['followers_count']     tweets1 = profile1['statuses_count']     tweets2 = profile2['statuses_count']     sum_reach1 = sum([x[1] for x in reach1])   sum_reach2 = sum([x[1] for x in reach2])   avg_followers1 = round(sum_reach1 / followers1, 2)   avg_followers2 = round(sum_reach2 / followers2, 2) We will also load the timelines for the two users, in particular, to observe the number of times their tweets have been favorited or retweeted: timeline_file1 = 'user_timeline_{}.jsonl'.format(screen_name1)   timeline_file2 = 'user_timeline_{}.jsonl'.format(screen_name2)   with open(timeline_file1) as f1, open(timeline_file2) as f2:     favorite_count1, retweet_count1 = [], []     favorite_count2, retweet_count2 = [], []     for line in f1:       tweet = json.loads(line)       favorite_count1.append(tweet['favorite_count'])       retweet_count1.append(tweet['retweet_count'])     for line in f2:       tweet = json.loads(line)       favorite_count2.append(tweet['favorite_count'])       retweet_count2.append(tweet['retweet_count']) The preceding numbers are then aggregated into average number of favorites and average number of retweets, both in absolute terms and per number of followers: avg_favorite1 = round(sum(favorite_count1) / tweets1, 2)   avg_favorite2 = round(sum(favorite_count2) / tweets2, 2)   avg_retweet1 = round(sum(retweet_count1) / tweets1, 2)   avg_retweet2 = round(sum(retweet_count2) / tweets2, 2)   favorite_per_user1 = round(sum(favorite_count1) / followers1, 2)   favorite_per_user2 = round(sum(favorite_count2) / followers2, 2)   retweet_per_user1 = round(sum(retweet_count1) / followers1, 2)   retweet_per_user2 = round(sum(retweet_count2) / followers2, 2)   print("----- Stats {} -----".format(screen_name1))   print("{} followers".format(followers1))   print("{} users reached by 1-degree connections".format(sum_reach1))   print("Average number of followers for {}'s followers: {}".format(screen_name1, avg_followers1))   print("Favorited {} times ({} per tweet, {} per user)".format(sum(favorite_count1), avg_favorite1, favorite_per_user1))   print("Retweeted {} times ({} per tweet, {} per user)".format(sum(retweet_count1), avg_retweet1, retweet_per_user1))   print("----- Stats {} -----".format(screen_name2))   print("{} followers".format(followers2))   print("{} users reached by 1-degree connections".format(sum_reach2))   print("Average number of followers for {}'s followers: {}".format(screen_name2, avg_followers2))   print("Favorited {} times ({} per tweet, {} per user)".format(sum(favorite_count2), avg_favorite2, favorite_per_user2))   print("Retweeted {} times ({} per tweet, {} per user)".format(sum(retweet_count2), avg_retweet2, retweet_per_user2)) This script takes two arguments from the command line and assumes that the data has already been downloaded. In particular, for both users, we need the data about followers and the respective user timelines. The script is somehow verbose, because it computes the same operations for two profiles and prints everything on the terminal. We can break it down into different parts. Firstly, we will look into the followers' followers. This will provide some information related to the part of the network immediately connected to the given user. In other words, it should answer the question how many users can I reach if all my followers retweet me? We can achieve this by reading the users/<user>/followers.jsonl file and keeping a list of tuples, where each tuple represents one of the followers and is in the (screen_name, followers_count)form. Keeping the screen name at this stage is useful in case we want to observe who the users with the highest number of followers are (not computed in the script, but easy to produce using sorted()). In the second step, we will read the user profile from the users/<user>/user_profile.jsonfile so that we can get information about the total number of followers and the total number of tweets. With the data collected so far, we can compute the total number of users who are reachable within a degree of separation (follower of a follower) and the average number of followers of a follower. This is achieved via the following lines: sum_reach1 = sum([x[1] for x in reach1]) avg_followers1 = round(sum_reach1 / followers1, 2) The first one uses a list comprehension to iterate through the list of tuples mentioned previously, while the second one is a simple arithmetic average, rounded to two decimal points. The third part of the script reads the user timeline from the user_timeline_<user>.jsonlfile and collects information about the number of retweets and favorite for each tweet. Putting everything together allows us to calculate how many times a user has been retweeted or favorited and what is the average number of retweet/favorite per tweet and follower. To provide an example, I'll perform some vanity analysis and compare my account,@marcobonzanini, with Packt Publishing: $ python twitter_influence.py marcobonzanini PacktPub The script produces the following output: ----- Stats marcobonzanini ----- 282 followers 1411136 users reached by 1-degree connections Average number of followers for marcobonzanini's followers: 5004.03 Favorited 268 times (1.47 per tweet, 0.95 per user) Retweeted 912 times (5.01 per tweet, 3.23 per user) ----- Stats PacktPub ----- 10209 followers 29961760 users reached by 1-degree connections Average number of followers for PacktPub's followers: 2934.84 Favorited 3554 times (0.33 per tweet, 0.35 per user) Retweeted 6434 times (0.6 per tweet, 0.63 per user) As you can see, the raw number of followers shows no contest, with Packt Publishing having approximatively 35 times more followers than me. The interesting part of this analysis comes up when we compare the average number of retweets and favorites, apparently my followers are much more engaged with my content than PacktPub's. Is this enough to declare than I'm an influencer while PacktPub is not? Clearly not. What we observe here is a natural consequence of the fact that my tweets are probably more focused on specific topics (Python and data science), hence my followers are already more interested in what I'm publishing. On the other side, the content produced by Packt Publishing is highly diverse as it ranges across many different technologies. This diversity is also reflected in PacktPub's followers, who include developers, designers, scientists, system administrator, and so on. For this reason, each of PacktPub's tweet is found interesting (that is worth retweeting) by a smaller proportion of their followers. Summary In this article,we discussed mining data from Twitter by focusing on the analysis of user connections and interactions. In particular, we discussed how to compare influence and engagement between users. For more information on social media mining, refer the following books by Packt Publishing: Social Media Mining with R: https://www.packtpub.com/big-data-and-business-intelligence/social-media-mining-r Mastering Social Media Mining with R: https://www.packtpub.com/big-data-and-business-intelligence/mastering-social-media-mining-r Further resources on this subject: Probabilistic Graphical Models in R [article] Machine Learning Tasks [article] Support Vector Machines as a Classification Engine [article]
Read more
  • 0
  • 0
  • 3721
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $19.99/month. Cancel anytime
article-image-training-neural-networks-efficiently-using-keras
Packt
22 Feb 2016
9 min read
Save for later

Training neural networks efficiently using Keras

Packt
22 Feb 2016
9 min read
In this article, we will take a look at Keras, one of the most recently developed libraries to facilitate neural network training. The development on Keras started in the early months of 2015; as of today, it has evolved into one of the most popular and widely used libraries that are built on top of Theano, and allows us to utilize our GPU to accelerate neural network training. One of its prominent features is that it's a very intuitive API, which allows us to implement neural networks in only a few lines of code. Once you have Theano installed, you can install Keras from PyPI by executing the following command from your terminal command line: (For more resources related to this topic, see here.) pip install Keras For more information about Keras, please visit the official website at http://keras.io. To see what neural network training via Keras looks like, let's implement a multilayer perceptron to classify the handwritten digits from the MNIST dataset. The MNIST dataset can be downloaded from http://yann.lecun.com/exdb/mnist/ in four parts as listed here: train-images-idx3-ubyte.gz: These are training set images (9912422 bytes) train-labels-idx1-ubyte.gz: These are training set labels (28881 bytes) t10k-images-idx3-ubyte.gz: These are test set images (1648877 bytes) t10k-labels-idx1-ubyte.gz: These are test set labels (4542 bytes) After downloading and unzipped the archives, we place the files into a directory mnist in our current working directory, so that we can load the training as well as the test dataset using the following function: import os import struct import numpy as np def load_mnist(path, kind='train'): """Load MNIST data from `path`""" labels_path = os.path.join(path, '%s-labels-idx1-ubyte' % kind) images_path = os.path.join(path, '%s-images-idx3-ubyte' % kind) with open(labels_path, 'rb') as lbpath: magic, n = struct.unpack('>II', lbpath.read(8)) labels = np.fromfile(lbpath, dtype=np.uint8) with open(images_path, 'rb') as imgpath: magic, num, rows, cols = struct.unpack(">IIII", imgpath.read(16)) images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784) return images, labels X_train, y_train = load_mnist('mnist', kind='train') print('Rows: %d, columns: %d' % (X_train.shape[0], X_train.shape[1])) Rows: 60000, columns: 784 X_test, y_test = load_mnist('mnist', kind='t10k') print('Rows: %d, columns: %d' % (X_test.shape[0], X_test.shape[1])) Rows: 10000, columns: 784 On the following pages, we will walk through the code examples for using Keras step by step, which you can directly execute from your Python interpreter. However, if you are interested in training the neural network on your GPU, you can either put it into a Python script, or download the respective code from the Packt Publishing website. In order to run the Python script on your GPU, execute the following command from the directory where the mnist_keras_mlp.py file is located: THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32 python mnist_keras_mlp.py To continue with the preparation of the training data, let's cast the MNIST image array into 32-bit format: >>> import theano >>> theano.config.floatX = 'float32' >>> X_train = X_train.astype(theano.config.floatX) >>> X_test = X_test.astype(theano.config.floatX) Next, we need to convert the class labels (integers 0-9) into the one-hot format. Fortunately, Keras provides a convenient tool for this: >>> from keras.utils import np_utils >>> print('First 3 labels: ', y_train[:3]) First 3 labels: [5 0 4] >>> y_train_ohe = np_utils.to_categorical(y_train) >>> print('nFirst 3 labels (one-hot):n', y_train_ohe[:3]) First 3 labels (one-hot): [[ 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.] [ 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]] Now, we can get to the interesting part and implement a neural network. However, we will replace the logistic units in the hidden layer with hyperbolic tangent activation functions, replace the logistic function in the output layer with softmax, and add an additional hidden layer. Keras makes these tasks very simple, as you can see in the following code implementation: >>> from keras.models import Sequential >>> from keras.layers.core import Dense >>> from keras.optimizers import SGD >>> np.random.seed(1) >>> model = Sequential() >>> model.add(Dense(input_dim=X_train.shape[1], ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=50, ... init='uniform', ... activation='tanh')) >>> model.add(Dense(input_dim=50, ... output_dim=y_train_ohe.shape[1], ... init='uniform', ... activation='softmax')) >>> sgd = SGD(lr=0.001, decay=1e-7, momentum=.9) >>> model.compile(loss='categorical_crossentropy', optimizer=sgd) First, we initialize a new model using the Sequential class to implement a feedforward neural network. Then, we can add as many layers to it as we like. However, since the first layer that we add is the input layer, we have to make sure that the input_dim attribute matches the number of features (columns) in the training set (here, 768). Also, we have to make sure that the number of output units (output_dim) and input units (input_dim) of two consecutive layers match. In the preceding example, we added two hidden layers with 50 hidden units plus 1 bias unit each. Note that bias units are initialized to 0 in fully connected networks in Keras. This is in contrast to the MLP implementation, where we initialized the bias units to 1, which is a more common (not necessarily better) convention. Finally, the number of units in the output layer should be equal to the number of unique class labels—the number of columns in the one-hot encoded class label array. Before we can compile our model, we also have to define an optimizer. In the preceding example, we chose a stochastic gradient descent optimization. Furthermore, we can set values for the weight decay constant and momentum learning to adjust the learning rate at each epoch. Lastly, we set the cost (or loss) function to categorical_crossentropy. The (binary) cross-entropy is just the technical term for the cost function in logistic regression, and the categorical cross-entropy is its generalization for multi-class predictions via softmax. After compiling the model, we can now train it by calling the fit method. Here, we are using mini-batch stochastic gradient with a batch size of 300 training samples per batch. We train the MLP over 50 epochs, and we can follow the optimization of the cost function during training by setting verbose=1. The validation_split parameter is especially handy, since it will reserve 10 percent of the training data (here, 6,000 samples) for validation after each epoch, so that we can check if the model is overfitting during training. >>> model.fit(X_train, ... y_train_ohe, ... nb_epoch=50, ... batch_size=300, ... verbose=1, ... validation_split=0.1, ... show_accuracy=True) Train on 54000 samples, validate on 6000 samples Epoch 0 54000/54000 [==============================] - 1s - loss: 2.2290 - acc: 0.3592 - val_loss: 2.1094 - val_acc: 0.5342 Epoch 1 54000/54000 [==============================] - 1s - loss: 1.8850 - acc: 0.5279 - val_loss: 1.6098 - val_acc: 0.5617 Epoch 2 54000/54000 [==============================] - 1s - loss: 1.3903 - acc: 0.5884 - val_loss: 1.1666 - val_acc: 0.6707 Epoch 3 54000/54000 [==============================] - 1s - loss: 1.0592 - acc: 0.6936 - val_loss: 0.8961 - val_acc: 0.7615 […] Epoch 49 54000/54000 [==============================] - 1s - loss: 0.1907 - acc: 0.9432 - val_loss: 0.1749 - val_acc: 0.9482 Printing the value of the cost function is extremely useful during training, since we can quickly spot whether the cost is decreasing during training and stop the algorithm earlier if otherwise to tune the hyperparameters values. To predict the class labels, we can then use the predict_classes method to return the class labels directly as integers: >>> y_train_pred = model.predict_classes(X_train, verbose=0) >>> print('First 3 predictions: ', y_train_pred[:3]) >>> First 3 predictions: [5 0 4] Finally, let's print the model accuracy on training and test sets: >>> train_acc = np.sum( ... y_train == y_train_pred, axis=0) / X_train.shape[0] >>> print('Training accuracy: %.2f%%' % (train_acc * 100)) Training accuracy: 94.51% >>> y_test_pred = model.predict_classes(X_test, verbose=0) >>> test_acc = np.sum(y_test == y_test_pred, ... axis=0) / X_test.shape[0] print('Test accuracy: %.2f%%' % (test_acc * 100)) Test accuracy: 94.39% Note that this is just a very simple neural network without optimized tuning parameters. If you are interested in playing more with Keras, please feel free to further tweak the learning rate, momentum, weight decay, and number of hidden units. Although Keras is great library for implementing and experimenting with neural networks, there are many other Theano wrapper libraries that are worth mentioning. A prominent example is Pylearn2 (http://deeplearning.net/software/pylearn2/), which has been developed in the LISA lab in Montreal. Also, Lasagne (https://github.com/Lasagne/Lasagne) may be of interest to you if you prefer a more minimalistic but extensible library, that offers more control over the underlying Theano code. Summary We caught a glimpse of the most beautiful and most exciting algorithms in the whole machine learning field: artificial neural networks. I can recommend you to follow the works of the leading experts in this field, such as Geoff Hinton (http://www.cs.toronto.edu/~hinton/), Andrew Ng (http://www.andrewng.org), Yann LeCun (http://yann.lecun.com), Juergen Schmidhuber (http://people.idsia.ch/~juergen/), and Yoshua Bengio (http://www.iro.umontreal.ca/~bengioy), just to name a few. To learn more about material design, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Building Machine Learning Systems with Python (https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python) Neural Network Programming with Java (https://www.packtpub.com/networking-and-servers/neural-network-programming-java) Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Machine learning and Python – the Dream Team [article] Adding a Spark to R [article]
Read more
  • 0
  • 0
  • 3708

article-image-labview-basics
Packt
02 Nov 2016
8 min read
Save for later

LabVIEW Basics

Packt
02 Nov 2016
8 min read
In this article by Behzad Ehsani, author of the book Data Acquisition using LabVIEW, after a brief introduction and a short note on installation, we will go over the most widely used pallets and objects Icon tool bar from a standard installation of LabVIEW and a brief explanation of what each object does. (For more resources related to this topic, see here.) Introduction to LabVIEW LabVIEW is a graphical developing and testing environment unlike any other test and development tool available in the industry. LabVIEW sets itself apart from traditional programming environment by its complete graphical approach to programming. As an example, while representation of a while loop in a text based language such as the C language consists of several predefined, extremely compact and sometimes extremely cryptic lines of text, a while a loop in LabVIEW, is actually a graphical loop. The environment is extremely intuitive and powerful, which makes for a short learning cure for the beginner. LabVIEW is based on what is called G language, but there are still other languages, especially C under the hood. However, the ease of use and power of LabVIEW is somewhat deceiving to a novice user. Many people have attempt to start projects in LabVIEW only because at the first glace, the graphical nature of interface and the concept of drag an drop used in LabVIEW, appears to do away with required basics of programming concepts and classical education in programming science and engineering. This is far from the reality of using LabVIEW as the predominant development environment. While it is true that in many higher level development and testing environment, specially when using complicated test equipment and complex mathematical calculations or even creating embedded software LabVIEW's approach will be much more time efficient and bug free environment which otherwise would require several lines of code in traditional text based programming environment, one must be aware of LabVIEW's strengths and possible weaknesses. LabVIEW does not completely replace the need for traditional text based languages and depending on the entire nature of a project, LabVIEW or another traditional text based language such as C may be the most suitable programming or test environment. Installing LabVIEW Installation of LabVIEW is very simple and it is just as routine as any modern day program installation; that is, Insert the DVD 1 and follow on-screen guided installation steps. LabVIEW comes in one DVD for Mac and Linux version but in four or more DVDs for the Windows edition (depending on additional software, different licensing and additional libraries and packages purchased.) In this article we will use LabVIEW 2013 Professional Development version for Windows. Given the target audience of this article, we assume the user is well capable of installation of the program. Installation is also well documented by National Instruments and the mandatory one year support purchase with each copy of LabVIEW is a valuable source of live and email help. Also, NI web site (www.ni.com) has many user support groups that are also a great source of support, example codes, discussion groups and local group events and meeting of fellow LabVIEW developers, etc. One worthy note for those who are new to installation of LabVIEW is that the installation DVDs include much more than what an average user would need and pay for. We do strongly suggest that you install additional software (beyond what has been purchased and licensed or immediate need!) These additional software are fully functional (in demo mode for 7 days) which may be extended for about a month with online registration. This is a very good opportunity to have hands on experience with even more of power and functionality that LabVIEW is capable to offer. The additional information gained by installing other available software on the DVDs may help in further development of a given project. Just imagine if the current development of a robot only encompasses mechanical movements and sensors today, optical recognition probably is going to follow sooner than one may think. If data acquisition using expensive hardware and software may be possible in one location, the need to web sharing and remote control of the setup is just around the corner. It is very helpful to at least be aware of what packages are currently available and be able to install and test them prior to a full purchase and implementation. The following screenshot shows what may be installed if almost all software on all DVDs are selected: When installing a fresh version of LabVIEW, if you do decide to observe the advice above, make sure to click on the + sign next to each package you decide to install and prevent any installation of LabWindows/CVI.... and Measurement Studio... for Visual Studio LabWindows according to National Instruments .., is an ANSI C integrated development environment. Also note that by default NI device drivers are not selected to be installed. Device drivers are an essential part of any data acquisition and appropriate drivers for communications and instrument(s) control must be installed before LabVIEW can interact with external equipments. Also, note that device drivers (on Windows installations) come on a separate DVD which means that one does not have to install device drivers at the same time that the main application and other modules are installed; they can be installed at any time later on. Almost all well established vendors are packaging their product with LabVIEW drivers and example codes. If a driver is not readily available, National Instruments has programmers that would do just that. But this would come at a cost to the user. VI Package manager, now installed as a part of standard installation is also a must these days. National Instruments distributes third party software and drivers and public domain packages via VI Package manager. Appropriate software and drivers for these microcontrollers are installed via VI Package manager. You can install many public domain packages that further installs many useful LabVIEW toolkits to a LabVIEW installation and can be used just as those that are delivered professionally by National Instruments. Finally, note that the more modules, packages and software are selected to be installed the longer it will take to complete the installation. This may sound like making an obvious point but surprisingly enough installation of all software on the three DVDs (for Windows) take up over five hours! On a standard laptop or pc we used. Obviously a more powerful PC (such as one with solid sate hard drive) my not take such log time: LabVIEW Basics Once the LabVIEW applications is launched, by default two blank windows open simultaneously; a Front Panel and a Block Diagram window and a VI is created: VIs or Virtual Instruments are heart and soul of LabVIEW. They are what separate LabVIEW from all other text based development environments. In LabVIEW everything is an object which is represented graphically. A VI may only consist of a few objects or hundreds of objects embedded in many subVIs These graphical representation of a thing, be it a simple while loop, a complex mathematical concept such as polynomial interpolation or simply a Boolean constant are all graphically represented. To use an object right-click inside the block diagram or front panel window, a pallet list appears. Follow the arrow and pick an object from the list of object from subsequent pallet an place it on the appropriate window. The selected object now can be dragged and place on different location on the appropriate window and is ready to be wired. Depending on what kind of object is selected, a graphical representation of the object appears on both windows. Of cores there are many exceptions to this rule. For example a while loop can only be selected in Block Diagram and by itself, a while loop does not have a graphical representation on the front panel window. Needless to say, LabVIEW also has keyboard combination that expedite selecting and placing any given toolkit objects onto the appropriate window. Each object has one (or several) wire connections going into as input(s) and coming out as its output(s). A VI becomes functional when a minimum number of wires are appropriately connected to input and output of one or more object. Later, we will use an example to illustrate how a basic LabVIEW VI is created and executed. Highlights LabVIEW is a complete object-oriented development and test environment based on G language. As such it is a very powerful and complex environment. In article one we went through introduction to LabVIEW and its main functionality of each of its icon by way of an actual user interactive example. Accompanied by appropriate hardware (both NI as well as many industry standard test, measurement and development hardware products) LabVIEW is capable to cover from developing embedded systems to fuzzy logic and almost everything in between! Summary In this article we cover the basics of LabVIEW, from installation to in depth explanation of each and every element in the toolbar. Resources for Article: Further resources on this subject: Python Data Analysis Utilities [article] Data mining [article] PostgreSQL in Action [article]
Read more
  • 0
  • 0
  • 3704

article-image-working-incanter-datasets
Packt
04 Feb 2015
28 min read
Save for later

Working with Incanter Datasets

Packt
04 Feb 2015
28 min read
In this article by Eric Rochester author of the book, Clojure Data Analysis Cookbook, Second Edition, we will cover the following recipes: Loading Incanter's sample datasets Loading Clojure data structures into datasets Viewing datasets interactively with view Converting datasets to matrices Using infix formulas in Incanter Selecting columns with $ Selecting rows with $ Filtering datasets with $where Grouping data with $group-by Saving datasets to CSV and JSON Projecting from multiple datasets with $join (For more resources related to this topic, see here.) Introduction Incanter combines the power to do statistics using a fully-featured statistical language such as R (http://www.r-project.org/) with the ease and joy of Clojure. Incanter's core data structure is the dataset, so we'll spend some time in this article to look at how to use them effectively. While learning basic tools in this manner is often not the most exciting way to spend your time, it can still be incredibly useful. At its most fundamental level, an Incanter dataset is a table of rows. Each row has the same set of columns, much like a spreadsheet. The data in each cell of an Incanter dataset can be a string or a numeric. However, some operations require the data to only be numeric. First you'll learn how to populate and view datasets, then you'll learn different ways to query and project the parts of the dataset that you're interested in onto a new dataset. Finally, we'll take a look at how to save datasets and merge multiple datasets together. Loading Incanter's sample datasets Incanter comes with a set of default datasets that are useful for exploring Incanter's functions. I haven't made use of them in this book, since there is so much data available in other places, but they're a great way to get a feel of what you can do with Incanter. Some of these datasets—for instance, the Iris dataset—are widely used to teach and test statistical algorithms. It contains the species and petal and sepal dimensions for 50 irises. This is the dataset that we'll access today. In this recipe, we'll load a dataset and see what it contains. Getting ready We'll need to include Incanter in our Leiningen project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll also need to include the right Incanter namespaces into our script or REPL: (use '(incanter core datasets)) How to do it… Once the namespaces are available, we can access the datasets easily: user=> (def iris (get-dataset :iris))#'user/iris user=> (col-names iris)[:Sepal.Length :Sepal.Width :Petal.Length :Petal.Width :Species]user=> (nrow iris)150 user=> (set ($ :Species iris))#{"versicolor" "virginica" "setosa"} How it works… We use the get-dataset function to access the built-in datasets. In this case, we're loading the Fisher's Iris dataset, sometimes called Anderson's dataset. This is a multivariate dataset for discriminant analysis. It gives petal and sepal measurements for 150 different Irises of three different species. Incanter's sample datasets cover a wide variety of topics—from U.S. arrests to plant growth and ultrasonic calibration. They can be used to test different algorithms and analyses and to work with different types of data. By the way, the names of functions should be familiar to you if you've previously used R. Incanter often uses the names of R's functions instead of using the Clojure names for the same functions. For example, the preceding code sample used nrow instead of count. There's more... Incanter's API documentation for get-dataset (http://liebke.github.com/incanter/datasets-api.html#incanter.datasets/get-dataset) lists more sample datasets, and you can refer to it for the latest information about the data that Incanter bundles. Loading Clojure data structures into datasets While they are good for learning, Incanter's built-in datasets probably won't be that useful for your work (unless you work with irises). Other recipes cover ways to get data from CSV files and other sources into Incanter. Incanter also accepts native Clojure data structures in a number of formats. We'll take look at a couple of these in this recipe. Getting ready We'll just need Incanter listed in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll also need to include this in our script or REPL: (use 'incanter.core) How to do it… The primary function used to convert data into a dataset is to-dataset. While it can convert single, scalar values into a dataset, we'll start with slightly more complicated inputs. Generally, you'll be working with at least a matrix. If you pass this to to-dataset, what do you get? user=> (def matrix-set (to-dataset [[1 2 3] [4 5 6]]))#'user/matrix-set user=> (nrow matrix-set)2user=> (col-names matrix-set)[:col-0 :col-1 :col-2] All the data's here, but it can be labeled in a better way. Does to-dataset handle maps? user=> (def map-set (to-dataset {:a 1, :b 2, :c 3}))#'user/map-set user=> (nrow map-set)1 user=> (col-names map-set)[:a :c :b] So, map keys become the column labels. That's much more intuitive. Let's throw a sequence of maps at it: user=> (def maps-set (to-dataset [{:a 1, :b 2, :c 3},                                 {:a 4, :b 5, :c 6}]))#'user/maps-setuser=> (nrow maps-set)2user=> (col-names maps-set)[:a :c :b] This is much more useful. We can also create a dataset by passing the column vector and the row matrix separately to dataset: user=> (def matrix-set-2         (dataset [:a :b :c]                         [[1 2 3] [4 5 6]]))#'user/matrix-set-2 user=> (nrow matrix-set-2)2 user=> (col-names matrix-set-2)[:c :b :a] How it works… The to-dataset function looks at the input and tries to process it intelligently. If given a sequence of maps, the column names are taken from the keys of the first map in the sequence. Ultimately, it uses the dataset constructor to create the dataset. When you want the most control, you should also use the dataset. It requires the dataset to be passed in as a column vector and a row matrix. When the data is in this format or when we need the most control—to rename the columns, for instance—we can use dataset. Viewing datasets interactively with view Being able to interact with our data programmatically is important, but sometimes it's also helpful to be able to look at it. This can be especially useful when you do data exploration. Getting ready We'll need to have Incanter in our project.clj file and script or REPL, so we'll use the same setup as we did for the Loading Incanter's sample datasets recipe, as follows. We'll also use the Iris dataset from that recipe. (use '(incanter core datasets)) How to do it… Incanter makes this very easy. Let's take a look at just how simple it is: First, we need to load the dataset, as follows: user=> (def iris (get-dataset :iris)) #'user/iris Then we just call view on the dataset: user=> (view iris) This function returns the Swing window frame, which contains our data, as shown in the following screenshot. This window should also be open on your desktop, although for me, it's usually hiding behind another window: How it works… Incanter's view function takes any object and tries to display it graphically. In this case, it simply displays the raw data as a table. Converting datasets to matrices Although datasets are often convenient, many times we'll want to treat our data as a matrix from linear algebra. In Incanter, matrices store a table of doubles. This provides good performance in a compact data structure. Moreover, we'll need matrices many times because some of Incanter's functions, such as trans, only operate on a matrix. Plus, it implements Clojure's ISeq interface, so interacting with matrices is also convenient. Getting ready For this recipe, we'll need the Incanter libraries, so we'll use this project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]]) We'll use the core and io namespaces, so we'll load these into our script or REPL: (use '(incanter core io)) This line binds the file name to the identifier data-file: (def data-file "data/all_160_in_51.P35.csv") How to do it… For this recipe, we'll create a dataset, convert it to a matrix, and then perform some operations on it: First, we need to read the data into a dataset, as follows: (def va-data (read-dataset data-file :header true)) Then, in order to convert it to a matrix, we just pass it to the to-matrix function. Before we do this, we'll pull out a few of the columns since matrixes can only contain floating-point numbers: (def va-matrix    (to-matrix ($ [:POP100 :HU100 :P035001] va-data))) Now that it's a matrix, we can treat it like a sequence of rows. Here, we pass it to first in order to get the first row, take in order to get a subset of the matrix, and count in order to get the number of rows in the matrix: user=> (first va-matrix) A 1x3 matrix ------------- 8.19e+03 4.27e+03 2.06e+03   user=> (count va-matrix) 591 We can also use Incanter's matrix operators to get the sum of each column, for instance. The plus function takes each row and sums each column separately: user=> (reduce plus va-matrix) A 1x3 matrix ------------- 5.43e+06 2.26e+06 1.33e+06 How it works… The to-matrix function takes a dataset of floating-point values and returns a compact matrix. Matrices are used by many of Incanter's more sophisticated analysis functions, as they're easy to work with. There's more… In this recipe, we saw the plus matrix operator. Incanter defines a full suite of these. You can learn more about matrices and see what operators are available at https://github.com/liebke/incanter/wiki/matrices. Using infix formulas in Incanter There's a lot to like about lisp: macros, the simple syntax, and the rapid development cycle. Most of the time, it is fine if you treat math operators as functions and use prefix notations, which is a consistent, function-first syntax. This allows you to treat math operators in the same way as everything else so that you can pass them to reduce, or anything else you want to do. However, we're not taught to read math expressions using prefix notations (with the operator first). And especially when formulas get even a little complicated, tracing out exactly what's happening can get hairy. Getting ready For this recipe we'll just need Incanter in our project.clj file, so we'll use the dependencies statement—as well as the use statement—from the Loading Clojure data structures into datasets recipe. For data, we'll use the matrix that we created in the Converting datasets to matrices recipe. How to do it… Incanter has a macro that converts a standard math notation to a lisp notation. We'll explore that in this recipe: The $= macro changes its contents to use an infix notation, which is what we're used to from math class: user=> ($= 7 * 4)28user=> ($= 7 * 4 + 3)31 We can also work on whole matrixes or just parts of matrixes. In this example, we perform a scalar multiplication of the matrix: user=> ($= va-matrix * 4)A 591x3 matrix---------------3.28e+04 1.71e+04 8.22e+03 2.08e+03 9.16e+02 4.68e+02 1.19e+03 6.52e+02 3.08e+02...1.41e+03 7.32e+02 3.72e+02 1.31e+04 6.64e+03 3.49e+03 3.02e+04 9.60e+03 6.90e+03 user=> ($= (first va-matrix) * 4)A 1x3 matrix-------------3.28e+04 1.71e+04 8.22e+03 Using this, we can build complex expressions, such as this expression that takes the mean of the values in the first row of the matrix: user=> ($= (sum (first va-matrix)) /           (count (first va-matrix)))4839.333333333333 Or we can build expressions take the mean of each column, as follows: user=> ($= (reduce plus va-matrix) / (count va-matrix))A 1x3 matrix-------------9.19e+03 3.83e+03 2.25e+03 How it works… Any time you're working with macros and you wonder how they work, you can always get at their output expressions easily, so you can see what the computer is actually executing. The tool to do this is macroexpand-1. This expands the macro one step and returns the result. It's sibling function, macroexpand, expands the expression until there is no macro expression left. Usually, this is more than we want, so we just use macroexpand-1. Let's see what these macros expand into: user=> (macroexpand-1 '($= 7 * 4))(incanter.core/mult 7 4)user=> (macroexpand-1 '($= 7 * 4 + 3))(incanter.core/plus (incanter.core/mult 7 4) 3)user=> (macroexpand-1 '($= 3 + 7 * 4))(incanter.core/plus 3 (incanter.core/mult 7 4)) Here, we can see that the expression doesn't expand into Clojure's * or + functions, but it uses Incanter's matrix functions, mult and plus, instead. This allows it to handle a variety of input types, including matrices, intelligently. Otherwise, it switches around the expressions the way we'd expect. Also, we can see by comparing the last two lines of code that it even handles operator precedence correctly. Selecting columns with $ Often, you need to cut the data to make it more useful. One common transformation is to pull out all the values from one or more columns into a new dataset. This can be useful for generating summary statistics or aggregating the values of some columns. The Incanter macro $ slices out parts of a dataset. In this recipe, we'll see this in action. Getting ready For this recipe, we'll need to have Incanter listed in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                [org.clojure/data.csv "0.1.2"]]) We'll also need to include these libraries in our script or REPL: (require '[clojure.java.io :as io]         '[clojure.data.csv :as csv]         '[clojure.string :as str]         '[incanter.core :as i]) Moreover, we'll need some data. This time, we'll use some country data from the World Bank. Point your browser to http://data.worldbank.org/country and select a country. I picked China. Under World Development Indicators, there is a button labeled Download Data. Click on this button and select CSV. This will download a ZIP file. I extracted its contents into the data/chn directory in my project. I bound the filename for the primary data file to the data-file name. How to do it… We'll use the $ macro in several different ways to get different results. First, however, we'll need to load the data into a dataset, which we'll do in steps 1 and 2: Before we start, we'll need a couple of utilities that load the data file into a sequence of maps and makes a dataset out of those: (defn with-header [coll] (let [headers (map #(keyword (str/replace % space -))                      (first coll))]    (map (partial zipmap headers) (next coll))))   (defn read-country-data [filename] (with-open [r (io/reader filename)]    (i/to-dataset      (doall (with-header                (drop 2 (csv/read-csv r))))))) Now, using these functions, we can load the data: user=> (def chn-data (read-country-data data-file)) We can select columns to be pulled out from the dataset by passing the column names or numbers to the $ macro. It returns a sequence of the values in the column: user=> (i/$ :Indicator-Code chn-data) ("AG.AGR.TRAC.NO" "AG.CON.FERT.PT.ZS" "AG.CON.FERT.ZS" … We can select more than one column by listing all of them in a vector. This time, the results are in a dataset: user=> (i/$ [:Indicator-Code :1992] chn-data)   |           :Indicator-Code |               :1992 | |---------------------------+---------------------| |           AG.AGR.TRAC.NO |             770629 | |         AG.CON.FERT.PT.ZS |                     | |           AG.CON.FERT.ZS |                     | |           AG.LND.AGRI.K2 |             5159980 | … We can list as many columns as we want, although the formatting might suffer: user=> (i/$ [:Indicator-Code :1992 :2002] chn-data)   |           :Indicator-Code |               :1992 |               :2002 | |---------------------------+---------------------+---------------------| |           AG.AGR.TRAC.NO |            770629 |                     | |         AG.CON.FERT.PT.ZS |                     |     122.73027213719 | |           AG.CON.FERT.ZS |                     |   373.087159048868 | |           AG.LND.AGRI.K2 |             5159980 |             5231970 | … How it works… The $ function is just a wrapper over Incanter's sel function. It provides a good way to slice columns out of the dataset, so we can focus only on the data that actually pertains to our analysis. There's more… The indicator codes for this dataset are a little cryptic. However, the code descriptions are in the dataset too: user=> (i/$ [0 1 2] [:Indicator-Code :Indicator-Name] chn-data)   |   :Indicator-Code |                                               :Indicator-Name | |-------------------+---------------------------------------------------------------| |   AG.AGR.TRAC.NO |                             Agricultural machinery, tractors | | AG.CON.FERT.PT.ZS |           Fertilizer consumption (% of fertilizer production) | |   AG.CON.FERT.ZS | Fertilizer consumption (kilograms per hectare of arable land) | … See also… For information on how to pull out specific rows, see the next recipe, Selecting rows with $. Selecting rows with $ The Incanter macro $ also pulls rows out of a dataset. In this recipe, we'll see this in action. Getting ready For this recipe, we'll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe. How to do it… Similar to how we use $ in order to select columns, there are several ways in which we can use it to select rows, shown as follows: We can create a sequence of the values of one row using $, and pass it the index of the row we want as well as passing :all for the columns: user=> (i/$ 0 :all chn-data) ("AG.AGR.TRAC.NO" "684290" "738526" "52661" "" "880859" "" "" "" "59657" "847916" "862078" "891170" "235524" "126440" "469106" "282282" "817857" "125442" "703117" "CHN" "66290" "705723" "824113" "" "151281" "669675" "861364" "559638" "191220" "180772" "73021" "858031" "734325" "Agricultural machinery, tractors" "100432" "" "796867" "" "China" "" "" "155602" "" "" "770629" "747900" "346786" "" "398946" "876470" "" "795713" "" "55360" "685202" "989139" "798506" "") We can also pull out a dataset containing multiple rows by passing more than one index into $ with a vector (There's a lot of data, even for three rows, so I won't show it here): (i/$ (range 3) :all chn-data) We can also combine the two ways to slice data in order to pull specific columns and rows. We can either pull out a single row or multiple rows: user=> (i/$ 0 [:Indicator-Code :1992] chn-data) ("AG.AGR.TRAC.NO" "770629") user=> (i/$ (range 3) [:Indicator-Code :1992] chn-data)   |   :Indicator-Code | :1992 | |-------------------+--------| |   AG.AGR.TRAC.NO | 770629 | | AG.CON.FERT.PT.ZS |       | |   AG.CON.FERT.ZS |       | How it works… The $ macro is the workhorse used to slice rows and project (or select) columns from datasets. When it's called with two indexing parameters, the first is the row or rows and the second is the column or columns. Filtering datasets with $where While we can filter datasets before we import them into Incanter, Incanter makes it easy to filter and create new datasets from the existing ones. We'll take a look at its query language in this recipe. Getting ready We'll use the same dependencies, imports, and data as we did in the Selecting columns with $ recipe. How to do it… Once we have the data, we query it using the $where function: For example, this creates a dataset with a row for the percentage of China's total land area that is used for agriculture: user=> (def land-use          (i/$where {:Indicator-Code "AG.LND.AGRI.ZS"}                    chn-data)) user=> (i/nrow land-use) 1 user=> (i/$ [:Indicator-Code :2000] land-use) ("AG.LND.AGRI.ZS" "56.2891584865366") The queries can be more complicated too. This expression picks out the data that exists for 1962 by filtering any empty strings in that column: user=> (i/$ (range 5) [:Indicator-Code :1962]          (i/$where {:1962 {:ne ""}} chn-data))   |   :Indicator-Code |             :1962 | |-------------------+-------------------| |   AG.AGR.TRAC.NO |             55360 | |   AG.LND.AGRI.K2 |           3460010 | |   AG.LND.AGRI.ZS | 37.0949187612906 | |   AG.LND.ARBL.HA |         103100000 | | AG.LND.ARBL.HA.PC | 0.154858284392508 | Incanter's query language is even more powerful than this, but these examples should show you the basic structure and give you an idea of the possibilities. How it works… To better understand how to use $where, let's break apart the last example: ($i/where {:1962 {:ne ""}} chn-data) The query is expressed as a hashmap from fields to values (highlighted). As we saw in the first example, the value can be a raw value, either a literal or an expression. This tests for inequality. ($i/where {:1962 {:ne ""}} chn-data) Each test pair is associated with a field in another hashmap (highlighted). In this example, both the hashmaps shown only contain one key-value pair. However, they might contain multiple pairs, which will all be ANDed together. Incanter supports a number of test operators. The basic boolean tests are :$gt (greater than), :$lt (less than), :$gte (greater than or equal to), :$lte (less than or equal to), :$eq (equal to), and :$ne (not equal). There are also some operators that take sets as parameters: :$in and :$nin (not in). The last operator—:$fn—is interesting. It allows you to use any predicate function. For example, this will randomly select approximately half of the dataset: (def random-half (i/$where {:Indicator-Code {:$fn (fn [_] (< (rand) 0.5))}}            chnchn-data)) There's more… For full details of the query language, see the documentation for incanter.core/query-dataset (http://liebke.github.com/incanter/core-api.html#incanter.core/query-dataset). Grouping data with $group-by Datasets often come with an inherent structure. Two or more rows might have the same value in one column, and we might want to leverage that by grouping those rows together in our analysis. Getting ready First, we'll need to declare a dependency on Incanter in the project.clj file: (defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"]                  [incanter "1.5.5"]                  [org.clojure/data.csv "0.1.2"]]) Next, we'll include Incanter core and io in our script or REPL: (require '[incanter.core :as i]          '[incanter.io :as i-io]) For data, we'll use the census race data for all the states. You can download it from http://www.ericrochester.com/clj-data-analysis/data/all_160.P3.csv. These lines will load the data into the race-data name: (def data-file "data/all_160.P3.csv") (def race-data (i-io/read-dataset data-file :header true)) How to do it… Incanter lets you group rows for further analysis or to summarize them with the $group-by function. All you need to do is pass the data to $group-by with the column or function to group on: (def by-state (i/$group-by :STATE race-data)) How it works… This function returns a map where each key is a map of the fields and values represented by that grouping. For example, this is how the keys look: user=> (take 5 (keys by-state)) ({:STATE 29} {:STATE 28} {:STATE 31} {:STATE 30} {:STATE 25}) We can get the data for Virginia back out by querying the group map for state 51. user=> (i/$ (range 3) [:GEOID :STATE :NAME :POP100]            (by-state {:STATE 51}))   | :GEOID | :STATE |         :NAME | :POP100 | |---------+--------+---------------+---------| | 5100148 |     51 | Abingdon town |   8191 | | 5100180 |     51 | Accomac town |     519 | | 5100724 |     51 | Alberta town |     298 | Saving datasets to CSV and JSON Once you've done the work of slicing, dicing, cleaning, and aggregating your datasets, you might want to save them. Incanter by itself doesn't have a good way to do this. However, with the help of some Clojure libraries, it's not difficult at all. Getting ready We'll need to include a number of dependencies in our project.clj file: (defproject inc-dsets "0.1.0":dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                 [org.clojure/data.csv "0.1.2"]                 [org.clojure/data.json "0.2.5"]]) We'll also need to include these libraries in our script or REPL: (require '[incanter.core :as i]          '[incanter.io :as i-io]          '[clojure.data.csv :as csv]          '[clojure.data.json :as json]          '[clojure.java.io :as io]) Also, we'll use the same data that we introduced in the Selecting columns with $ recipe. How to do it… This process is really as simple as getting the data and saving it. We'll pull out the data for the year 2000 from the larger dataset. We'll use this subset of the data in both the formats here: (def data2000 (i/$ [:Indicator-Code :Indicator-Name :2000] chn-data)) Saving data as CSV To save a dataset as a CSV, all in one statement, open a file and use clojure.data.csv/write-csv to write the column names and data to it: (with-open [f-out (io/writer "data/chn-2000.csv")] (csv/write-csv f-out [(map name (i/col-names data2000))]) (csv/write-csv f-out (i/to-list data2000))) Saving data as JSON To save a dataset as JSON, open a file and use clojure.data.json/write to serialize the file: (with-open [f-out (io/writer "data/chn-2000.json")] (json/write (:rows data2000) f-out)) How it works… For CSV and JSON, as well as many other data formats, the process is very similar. Get the data, open the file, and serialize data into it. There will be differences in how the output function wants the data (to-list or :rows), and there will be differences in how the output function is called (for instance, whether the file handle is the first or second argument). But generally, outputting datasets will be very similar and relatively simple. Projecting from multiple datasets with $join So far, we've been focusing on splitting up datasets, on dividing them into groups of rows or groups of columns with functions and macros such as $ or $where. However, sometimes we'd like to move in the other direction. We might have two related datasets and want to join them together to make a larger one. For example, we might want to join crime data to census data, or take any two related datasets that come from separate sources and analyze them together. Getting ready First, we'll need to include these dependencies in our project.clj file: (defproject inc-dsets "0.1.0" :dependencies [[org.clojure/clojure "1.6.0"]                 [incanter "1.5.5"]                  [org.clojure/data.csv "0.1.2"]]) We'll use these statements for inclusions: (require '[clojure.java.io :as io]          '[clojure.data.csv :as csv]          '[clojure.string :as str]          '[incanter.core :as i]) For our data file, we'll use the same data that we introduced in the Selecting columns with $ recipe: China's development dataset from the World Bank. How to do it… In this recipe, we'll take a look at how to join two datasets using Incanter: To begin with, we'll load the data from the data/chn/chn_Country_en_csv_v2.csv file. We'll use the with-header and read-country-data functions that were defined in the Selecting columns with $ recipe: (def data-file "data/chn/chn_Country_en_csv_v2.csv") (def chn-data (read-country-data data-file)) Currently, the data for each row contains the data for one indicator across many years. However, for some analyses, it will be more helpful to have each row contain the data for one indicator for one year. To do this, let's first pull out the data from 2 years into separate datasets. Note that for the second dataset, we'll only include a column to match the first dataset (:Indicator-Code) and the data column (:2000): (def chn-1990 (i/$ [:Indicator-Code :Indicator-Name :1990]        chn-data)) (def chn-2000 (i/$ [:Indicator-Code :2000] chn-data)) Now, we'll join these datasets back together. This is contrived, but it's easy to see how we will do this in a more meaningful example. For example, we might want to join the datasets from two different countries: (def chn-decade (i/$join [:Indicator-Code :Indicator-Code]            chn-1990 chn-2000)) From this point on, we can use chn-decade just as we use any other Incanter dataset. How it works… Let's take a look at this in more detail: (i/$join [:Indicator-Code :Indicator-Code] chn-1990 chn-2000) The pair of column keywords in a vector ([:Indicator-Code :Indicator-Code]) are the keys that the datasets will be joined on. In this case, the :Indicator-Code column from both the datasets is used, but the keys can be different for the two datasets. The first column that is listed will be from the first dataset (chn-1990), and the second column that is listed will be from the second dataset (chn-2000). This returns a new dataset. Each row of this new dataset is a superset of the corresponding rows from the two input datasets. Summary In this article we have covered covers the basics of working with Incanter datasets. Datasets are the core data structures used by Incanter, and understanding them is necessary in order to use Incanter effectively. Resources for Article: Further resources on this subject: The Hunt for Data [article] Limits of Game Data Analysis [article] Clojure for Domain-specific Languages - Design Concepts with Clojure [article]
Read more
  • 0
  • 0
  • 3693

article-image-writing-consumers
Packt
04 Mar 2015
20 min read
Save for later

Writing Consumers

Packt
04 Mar 2015
20 min read
This article by Nishant Garg, the author of the book Learning Apache Kafka Second Edition, focuses on the details of Writing Consumers. Consumers are the applications that consume the messages published by Kafka producers and process the data extracted from them. Like producers, consumers can also be different in nature, such as applications doing real-time or near real-time analysis, applications with NoSQL or data warehousing solutions, backend services, consumers for Hadoop, or other subscriber-based solutions. These consumers can also be implemented in different languages such as Java, C, and Python. (For more resources related to this topic, see here.) In this article, we will focus on the following topics: The Kafka Consumer API Java-based Kafka consumers Java-based Kafka consumers consuming partitioned messages At the end of the article, we will explore some of the important properties that can be set for a Kafka consumer. So, let's start. The preceding diagram explains the high-level working of the Kafka consumer when consuming the messages. The consumer subscribes to the message consumption from a specific topic on the Kafka broker. The consumer then issues a fetch request to the lead broker to consume the message partition by specifying the message offset (the beginning position of the message offset). Therefore, the Kafka consumer works in the pull model and always pulls all available messages after its current position in the Kafka log (the Kafka internal data representation). While subscribing, the consumer connects to any of the live nodes and requests metadata about the leaders for the partitions of a topic. This allows the consumer to communicate directly with the lead broker receiving the messages. Kafka topics are divided into a set of ordered partitions and each partition is consumed by one consumer only. Once a partition is consumed, the consumer changes the message offset to the next partition to be consumed. This represents the states about what has been consumed and also provides the flexibility of deliberately rewinding back to an old offset and re-consuming the partition. In the next few sections, we will discuss the API provided by Kafka for writing Java-based custom consumers. All the Kafka classes referred to in this article are actually written in Scala. Kafka consumer APIs Kafka provides two types of API for Java consumers: High-level API Low-level API The high-level consumer API The high-level consumer API is used when only data is needed and the handling of message offsets is not required. This API hides broker details from the consumer and allows effortless communication with the Kafka cluster by providing an abstraction over the low-level implementation. The high-level consumer stores the last offset (the position within the message partition where the consumer left off consuming the message), read from a specific partition in Zookeeper. This offset is stored based on the consumer group name provided to Kafka at the beginning of the process. The consumer group name is unique and global across the Kafka cluster and any new consumers with an in-use consumer group name may cause ambiguous behavior in the system. When a new process is started with the existing consumer group name, Kafka triggers a rebalance between the new and existing process threads for the consumer group. After the rebalance, some messages that are intended for a new process may go to an old process, causing unexpected results. To avoid this ambiguous behavior, any existing consumers should be shut down before starting new consumers for an existing consumer group name. The following are the classes that are imported to write Java-based basic consumers using the high-level consumer API for a Kafka cluster: ConsumerConnector: Kafka provides the ConsumerConnector interface (interface ConsumerConnector) that is further implemented by the ZookeeperConsumerConnector class (kafka.javaapi.consumer.ZookeeperConsumerConnector). This class is responsible for all the interaction a consumer has with ZooKeeper. The following is the class diagram for the ConsumerConnector class: KafkaStream: Objects of the kafka.consumer.KafkaStream class are returned by the createMessageStreams call from the ConsumerConnector implementation. This list of the KafkaStream objects is returned for each topic, which can further create an iterator over messages in the stream. The following is the Scala-based class declaration: class KafkaStream[K,V](private val queue:                       BlockingQueue[FetchedDataChunk],                       consumerTimeoutMs: Int,                       private val keyDecoder: Decoder[K],                       private val valueDecoder: Decoder[V],                       val clientId: String) Here, the parameters K and V specify the type for the partition key and message value, respectively. In the create call from the ConsumerConnector class, clients can specify the number of desired streams, where each stream object is used for single-threaded processing. These stream objects may represent the merging of multiple unique partitions. ConsumerConfig: The kafka.consumer.ConsumerConfig class encapsulates the property values required for establishing the connection with ZooKeeper, such as ZooKeeper URL, ZooKeeper session timeout, and ZooKeeper sink time. It also contains the property values required by the consumer such as group ID and so on. A high-level API-based working consumer example is discussed after the next section. The low-level consumer API The high-level API does not allow consumers to control interactions with brokers. Also known as "simple consumer API", the low-level consumer API is stateless and provides fine grained control over the communication between Kafka broker and the consumer. It allows consumers to set the message offset with every request raised to the broker and maintains the metadata at the consumer's end. This API can be used by both online as well as offline consumers such as Hadoop. These types of consumers can also perform multiple reads for the same message or manage transactions to ensure the message is consumed only once. Compared to the high-level consumer API, developers need to put in extra effort to gain low-level control within consumers by keeping track of offsets, figuring out the lead broker for the topic and partition, handling lead broker changes, and so on. In the low-level consumer API, consumers first query the live broker to find out the details about the lead broker. Information about the live broker can be passed on to the consumers either using a properties file or from the command line. The topicsMetadata() method of the kafka.javaapi.TopicMetadataResponse class is used to find out metadata about the topic of interest from the lead broker. For message partition reading, the kafka.api.OffsetRequest class defines two constants: EarliestTime and LatestTime, to find the beginning of the data in the logs and the new messages stream. These constants also help consumers to track which messages are already read. The main class used within the low-level consumer API is the SimpleConsumer (kafka.javaapi.consumer.SimpleConsumer) class. The following is the class diagram for the SimpleConsumer class:   A simple consumer class provides a connection to the lead broker for fetching the messages from the topic and methods to get the topic metadata and the list of offsets. A few more important classes for building different request objects are FetchRequest (kafka.api.FetchRequest), OffsetRequest (kafka.javaapi.OffsetRequest), OffsetFetchRequest (kafka.javaapi.OffsetFetchRequest), OffsetCommitRequest (kafka.javaapi.OffsetCommitRequest), and TopicMetadataRequest (kafka.javaapi.TopicMetadataRequest). All the examples in this article are based on the high-level consumer API. For examples based on the low-level consumer API, refer tohttps://cwiki.apache.org/confluence/display/KAFKA/0.8.0+SimpleConsumer+Example. Simple Java consumers Now we will start writing a single-threaded simple Java consumer developed using the high-level consumer API for consuming the messages from a topic. This SimpleHLConsumer class is used to fetch a message from a specific topic and consume it, assuming that there is a single partition within the topic. Importing classes As a first step, we need to import the following classes: import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; Defining properties As a next step, we need to define properties for making a connection with Zookeeper and pass these properties to the Kafka consumer using the following code: Properties props = new Properties(); props.put("zookeeper.connect", "localhost:2181"); props.put("group.id", "testgroup"); props.put("zookeeper.session.timeout.ms", "500"); props.put("zookeeper.sync.time.ms", "250"); props.put("auto.commit.interval.ms", "1000"); new ConsumerConfig(props); Now let us see the major properties mentioned in the code: zookeeper.connect: This property specifies the ZooKeeper <node:port> connection detail that is used to find the Zookeeper running instance in the cluster. In the Kafka cluster, Zookeeper is used to store offsets of messages consumed for a specific topic and partition by this consumer group. group.id: This property specifies the name for the consumer group shared by all the consumers within the group. This is also the process name used by Zookeeper to store offsets. zookeeper.session.timeout.ms: This property specifies the Zookeeper session timeout in milliseconds and represents the amount of time Kafka will wait for Zookeeper to respond to a request before giving up and continuing to consume messages. zookeeper.sync.time.ms: This property specifies the ZooKeeper sync time in milliseconds between the ZooKeeper leader and the followers. auto.commit.interval.ms: This property defines the frequency in milliseconds at which consumer offsets get committed to Zookeeper. Reading messages from a topic and printing them As a final step, we need to read the message using the following code: Map<String, Integer> topicMap = new HashMap<String, Integer>(); // 1 represents the single thread topicCount.put(topic, new Integer(1));   Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap = consumer.createMessageStreams(topicMap);   // Get the list of message streams for each topic, using the default decoder. List<KafkaStream<byte[], byte[]>>streamList =  consumerStreamsMap.get(topic);   for (final KafkaStream <byte[], byte[]> stream : streamList) { ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();   while (consumerIte.hasNext())     System.out.println("Message from Single Topic :: "     + new String(consumerIte.next().message())); } So the complete program will look like the following code: package kafka.examples.ch5;   import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties;   import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector;   public class SimpleHLConsumer {   private final ConsumerConnector consumer;   private final String topic;     public SimpleHLConsumer(String zookeeper, String groupId, String topic) {     consumer = kafka.consumer.Consumer         .createJavaConsumerConnector(createConsumerConfig(zookeeper,             groupId));     this.topic = topic;   }     private static ConsumerConfig createConsumerConfig(String zookeeper,         String groupId) {     Properties props = new Properties();     props.put("zookeeper.connect", zookeeper);     props.put("group.id", groupId);     props.put("zookeeper.session.timeout.ms", "500");     props.put("zookeeper.sync.time.ms", "250");     props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props);     }     public void testConsumer() {       Map<String, Integer> topicMap = new HashMap<String, Integer>();       // Define single thread for topic     topicMap.put(topic, new Integer(1));       Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap =         consumer.createMessageStreams(topicMap);       List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap         .get(topic);       for (final KafkaStream<byte[], byte[]> stream : streamList) {       ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();       while (consumerIte.hasNext())         System.out.println("Message from Single Topic :: "           + new String(consumerIte.next().message()));     }     if (consumer != null)       consumer.shutdown();   }     public static void main(String[] args) {       String zooKeeper = args[0];     String groupId = args[1];     String topic = args[2];     SimpleHLConsumer simpleHLConsumer = new SimpleHLConsumer(           zooKeeper, groupId, topic);     simpleHLConsumer.testConsumer();   }   } Before running this, make sure you have created the topic kafkatopic from the command line: [root@localhost kafka_2.9.2-0.8.1.1]#bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic kafkatopic Before compiling and running a Java-based Kafka program in the console, make sure you download the slf4j-1.7.7.tar.gz file from http://www.slf4j.org/download.html and copy slf4j-log4j12-1.7.7.jar contained within slf4j-1.7.7.tar.gz to the /opt/kafka_2.9.2-0.8.1.1/libs directory. Also add all the libraries available in /opt/kafka_2.9.2-0.8.1.1/libs to the classpath using the following commands: [root@localhost kafka_2.9.2-0.8.1.1]# export KAFKA_LIB=/opt/kafka_2.9.2-0.8.1.1/libs [root@localhost kafka_2.9.2-0.8.1.1]# export CLASSPATH=.:$KAFKA_LIB/jopt-simple-3.2.jar:$KAFKA_LIB/kafka_2.9.2-0.8.1.1.jar:$KAFKA_LIB/log4j-1.2.15.jar:$KAFKA_LIB/metrics-core-2.2.0.jar:$KAFKA_LIB/scala-library-2.9.2.jar:$KAFKA_LIB/slf4j-api-1.7.2.jar:$KAFKA_LIB/slf4j-log4j12-1.7.7.jar:$KAFKA_LIB/snappy-java-1.0.5.jar:$KAFKA_LIB/zkclient-0.3.jar:$KAFKA_LIB/zookeeper-3.3.4.jar Multithreaded Java consumers The previous example is a very basic example of a consumer that consumes messages from a single broker with no explicit partitioning of messages within the topic. Let's jump to the next level and write another program that consumes messages from multiple partitions connecting to single/multiple topics. A multithreaded, high-level, consumer-API-based design is usually based on the number of partitions in the topic and follows a one-to-one mapping approach between the thread and the partitions within the topic. For example, if four partitions are defined for any topic, as a best practice, only four threads should be initiated with the consumer application to read the data; otherwise, some conflicting behavior, such as threads never receiving a message or a thread receiving messages from multiple partitions, may occur. Also, receiving multiple messages will not guarantee that the messages will be placed in order. For example, a thread may receive two messages from the first partition and three from the second partition, then three more from the first partition, followed by some more from the first partition, even if the second partition has data available. Let's move further on. Importing classes As a first step, we need to import the following classes: import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector; Defining properties As the next step, we need to define properties for making a connection with Zookeeper and pass these properties to the Kafka consumer using the following code: Properties props = new Properties(); props.put("zookeeper.connect", "localhost:2181"); props.put("group.id", "testgroup"); props.put("zookeeper.session.timeout.ms", "500"); props.put("zookeeper.sync.time.ms", "250"); props.put("auto.commit.interval.ms", "1000"); new ConsumerConfig(props); The preceding properties have already been discussed in the previous example. For more details on Kafka consumer properties, refer to the last section of this article. Reading the message from threads and printing it The only difference in this section from the previous section is that we first create a thread pool and get the Kafka streams associated with each thread within the thread pool, as shown in the following code: // Define thread count for each topic topicMap.put(topic, new Integer(threadCount));   // Here we have used a single topic but we can also add // multiple topics to topicCount MAP Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap            = consumer.createMessageStreams(topicMap);   List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap.get(topic);   // Launching the thread pool executor = Executors.newFixedThreadPool(threadCount); The complete program listing for the multithread Kafka consumer based on the Kafka high-level consumer API is as follows: package kafka.examples.ch5;   import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.Properties; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors;   import kafka.consumer.ConsumerConfig; import kafka.consumer.ConsumerIterator; import kafka.consumer.KafkaStream; import kafka.javaapi.consumer.ConsumerConnector;   public class MultiThreadHLConsumer {     private ExecutorService executor;   private final ConsumerConnector consumer;   private final String topic;     public MultiThreadHLConsumer(String zookeeper, String groupId, String topic) {     consumer = kafka.consumer.Consumer         .createJavaConsumerConnector(createConsumerConfig(zookeeper, groupId));     this.topic = topic;   }     private static ConsumerConfig createConsumerConfig(String zookeeper,         String groupId) {     Properties props = new Properties();     props.put("zookeeper.connect", zookeeper);     props.put("group.id", groupId);     props.put("zookeeper.session.timeout.ms", "500");     props.put("zookeeper.sync.time.ms", "250");     props.put("auto.commit.interval.ms", "1000");       return new ConsumerConfig(props);     }     public void shutdown() {     if (consumer != null)       consumer.shutdown();     if (executor != null)       executor.shutdown();   }     public void testMultiThreadConsumer(int threadCount) {       Map<String, Integer> topicMap = new HashMap<String, Integer>();       // Define thread count for each topic     topicMap.put(topic, new Integer(threadCount));       // Here we have used a single topic but we can also add     // multiple topics to topicCount MAP     Map<String, List<KafkaStream<byte[], byte[]>>> consumerStreamsMap =         consumer.createMessageStreams(topicMap);       List<KafkaStream<byte[], byte[]>> streamList = consumerStreamsMap         .get(topic);       // Launching the thread pool     executor = Executors.newFixedThreadPool(threadCount);       // Creating an object messages consumption     int count = 0;     for (final KafkaStream<byte[], byte[]> stream : streamList) {       final int threadNumber = count;       executor.submit(new Runnable() {       public void run() {       ConsumerIterator<byte[], byte[]> consumerIte = stream.iterator();       while (consumerIte.hasNext())         System.out.println("Thread Number " + threadNumber + ": "         + new String(consumerIte.next().message()));         System.out.println("Shutting down Thread Number: " +         threadNumber);         }       });       count++;     }     if (consumer != null)       consumer.shutdown();     if (executor != null)       executor.shutdown();   }     public static void main(String[] args) {       String zooKeeper = args[0];     String groupId = args[1];     String topic = args[2];     int threadCount = Integer.parseInt(args[3]);     MultiThreadHLConsumer multiThreadHLConsumer =         new MultiThreadHLConsumer(zooKeeper, groupId, topic);     multiThreadHLConsumer.testMultiThreadConsumer(threadCount);     try {       Thread.sleep(10000);     } catch (InterruptedException ie) {       }     multiThreadHLConsumer.shutdown();     } } Compile the preceding program, and before running it, read the following tip. Before we run this program, we need to make sure our cluster is running as a multi-broker cluster (comprising either single or multiple nodes).  Once your multi-broker cluster is up, create a topic with four partitions and set the replication factor to 2 before running this program using the following command: [root@localhost kafka-0.8]# bin/kafka-topics.sh --zookeeper localhost:2181 --create --topic kafkatopic --partitions 4 --replication-factor 2 The Kafka consumer property list The following lists of a few important properties that can be configured for high-level, consumer-API-based Kafka consumers. The Scala class kafka.consumer.ConsumerConfig provides implementation-level details for consumer configurations. For a complete list, visit http://kafka.apache.org/documentation.html#consumerconfigs. Property name Description Default value group.id This property defines a unique identity for the set of consumers within the same consumer group.   consumer.id This property is specified for the Kafka consumer and generated automatically if not defined. null zookeeper.connect This property specifies the Zookeeper connection string, < hostname:port/chroot/path>. Kafka uses Zookeeper to store offsets of messages consumed for a specific topic and partition by the consumer group. /chroot/path defines the data location in a global zookeeper namespace.   client.id The client.id value is specified by the Kafka client with each request and is used to identify the client making the requests. ${group.id} zookeeper.session.timeout.ms This property defines the time (in milliseconds) for a Kafka consumer to wait for a Zookeeper pulse before it is declared dead and rebalance is initiated. 6000 zookeeper.connection.timeout.ms This value defines the maximum waiting time (in milliseconds) for the client to establish a connection with ZooKeeper. 6000 zookeeper.sync.time.ms This property defines the time it takes to sync a Zookeeper follower with the Zookeeper leader (in milliseconds). 2000 auto.commit.enable This property enables a periodical commit of message offsets to the Zookeeper that are already fetched by the consumer. In the event of consumer failures, these committed offsets are used as a starting position by the new consumers. true auto.commit.interval.ms This property defines the frequency (in milliseconds) for the consumed offsets to get committed to ZooKeeper. 60 * 1000 auto.offset.reset This property defines the offset value if an initial offset is available in Zookeeper or the offset is out of range. Possible values are: largest: reset to largest offset smallest: reset to smallest offset anything else: throw an exception largest consumer.timeout.ms This property throws an exception to the consumer if no message is available for consumption after the specified interval. -1 Summary In this article, we have learned how to write basic consumers and learned about some advanced levels of Java consumers that consume messages from partitions. Resources for Article: Further resources on this subject: Introducing Kafka? [article] Introduction To Apache Zookeeper [article] Creating Apache Jmeter™ Test Workbench [article]
Read more
  • 0
  • 0
  • 3687
article-image-how-to-maintain-apache-mesos
Vijin Boricha
13 Feb 2018
6 min read
Save for later

How to maintain Apache Mesos

Vijin Boricha
13 Feb 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by David Blomquist and Tomasz Janiszewski, titled Apache Mesos Cookbook. Throughout the course of the book, you will get to know tips and tricks along with best practices to follow when working with Mesos.[/box] In this article, we will learn about configuring logging options, setting up monitoring ecosystem, and upgrading your Mesos cluster. Logging and debugging Here we will configure logging options that will allow us to debug the state of Mesos. Getting ready We will assume Mesos is available on localhost port 5050. The steps provided here will work for either master or agents. How to do it... When Mesos is installed from pre-built packages, the logs are by default stored in /var/log/mesos/. When installing from a source build, storing logs is disabled by default. To change the log store location, we need to edit /etc/default/mesos and set the LOGS variable to the desired destination. For some reason, mesos-init-wrapper does not transfer the contents of /etc/mesos/log_dir to the --log_dir flag. That's why we need to set the log's destination in the environment variable. Remember that only Mesos logs will be stored there. Logs from third-party applications (for example, ZooKeeper) will still be sent to STDERR. Changing the default logging level can be done in one of two ways: by specifying the -- logging_level flag or by sending a request and changing the logging level at runtime for a specific period of time. For example, to change the logging level to INFO, just put it in the following code: /etc/mesos/logging_level echo INFO > /etc/mesos/logging_level The possible levels are INFO, WARNING, and ERROR. For example, to change the logging level to the most verbose for 15 minutes for debug purposes, we need to send the following request to the logging/toggle endpoint: curl -v -X POST localhost:5050/logging/toggle?level=3&duration=15mins How it works... Mesos uses the Google-glog library for debugging, but third-party dependencies such as ZooKeeper have their own logging solution. All configuration options are backed by glog and apply only to Mesos core code. Monitoring Now, we will set up monitoring for Mesos. Getting ready We must have a running monitoring ecosystem. Metrics storage could be a simple time- series database such as graphite, influxdb, or prometheus. In the following example, we are using graphite and our metrics are published with http://diamond.readthedocs.io/en/latest/. How to do it... Monitoring is enabled by default. Mesos does not provide any way to automatically push metrics to the registry. However, it exposes them as a JSON that can be periodically pulled and saved into the metrics registry:  Install Diamond using following command: pip install diamond  If additional packages are required to install them, run: sudo apt-get install python-pip python-dev build-essential. pip (Pip Installs Packages) is a Python package manager used to install software written in Python. Configure the metrics handler and interval. Open /etc/diamond/diamond.conf and ensure that there is a section for graphite configuration: [handler_graphite] class = handlers.GraphiteHandler host = <graphite.host> port = <graphite.port> Remember to replace graphite.host and graphite.port with real graphite details. Enable the default Mesos Collector. Create configuration files diamond-setup  - C MesosCollector. Check whether the configuration has proper values and edit them if needed. The configuration can be found in /etc/diamond/collectors/MesosCollector.conf. On master, this file should look like this: enabled = True host = localhost port = 5050 While on agent, the port could be different (5051), as follows: enabled = True host = localhost port = 5051 How it works... Mesos exposes metrics via the HTTP API. Diamond is a small process that periodically pulls metrics, parses them, and sends them to the metrics registry, in this case, graphite. The default implementation of Mesos Collector does not store all the available metrics so it's recommended to write a custom handler that will collect all the interesting information. See also... Metrics could be read from the following endpoints: http://mesos.apache.org/documentation/latest/endpoints/metrics/snapshot/ http://mesos.apache.org/documentation/latest/endpoints/slave/monitor/statistics/  http://mesos.apache.org/documentation/latest/endpoints/slave/state/ Upgrading Mesos In this recipe, you will learn how to upgrade your Mesos cluster. How to do it... Mesos release cadence is at least one release per quarter. Minor releases are backward compatible, although there could be some small incompatibilities or the dropping of deprecated methods. The recommended method of upgrading is to apply all intermediate versions. For example, to upgrade from 0.27.2 to 1.0.0, we should apply 0.28.0, 0.28.1, 0.28.2, and finally 1.0.0. If the agent's configuration changes, clearing the metadata directory is required. You can do this with the following code: rm -rv {MESOS_DIR}/metadata Here, {MESOS_DIR} should be replaced with the configured Mesos directory. Rolling upgrades is the preferred method of upgrading clusters, starting with masters and then agents. To minimize the impact on running tasks, if an agent's configuration changes and it becomes inaccessible, then it should be switched to maintenance mode. How it works... Configuration changes may require clearing the metadata because the changes may not be backward compatible. For example, when an agent runs with different isolators, it shouldn't attach to the already running processes without this isolator. The Mesos architecture will guarantee that the executors that were not attached to the Mesos agent will commit suicide after a configurable amount of time (--executor_registration_timeout). Maintenance mode allows you to declare the time window during which the agent will be inaccessible. When this occurs, Mesos will send a reverse offer to all the frameworks to drain that particular agent. The frameworks are responsible for shutting down its task and spawning it on another agent. The Maintenance mode is applied, even if the framework does not implement the HTTP API or is explicitly declined. Using maintenance mode can prevent restarting tasks multiple times. Consider the following example with five agents and one task, X. We schedule the rolling upgrade of all the agents. Task X is deployed on agent 1. When it goes down, it's moved to 2, then to 3, and so on. This approach is extremely inefficient because the task is restarted five times, but it only needs to be restarted twice. Maintenance mode enables the framework to optimally schedule the task to run on agent 5 when 1 goes down, and then return to 1 when 5 goes down: Worst case scenario of rolling upgrade without maintenance mode legend optimal solution of rolling upgrade with maintenance mode. We have learnt about running and maintaining Mesos. To know more about managing containers and understanding the scheduler API you may check out this book, Apache Mesos Cookbook.
Read more
  • 0
  • 0
  • 3686

article-image-ten-ipython-essentials
Packt
02 May 2013
10 min read
Save for later

Ten IPython essentials

Packt
02 May 2013
10 min read
(For more resources related to this topic, see here.) Running the IPython console If IPython has been installed correctly, you should be able to run it from a system shell with the ipython command. You can use this prompt like a regular Python interpreter as shown in the following screenshot: Command-line shell on Windows If you are on Windows and using the old cmd.exe shell, you should be aware that this tool is extremely limited. You could instead use a more powerful interpreter, such as Microsoft PowerShell, which is integrated by default in Windows 7 and 8. The simple fact that most common filesystem-related commands (namely, pwd, cd, ls, cp, ps, and so on) have the same name as in Unix should be a sufficient reason to switch. Of course, IPython offers much more than that. For example, IPython ships with tens of little commands that considerably improve productivity. Some of these commands help you get information about any Python function or object. For instance, have you ever had a doubt about how to use the super function to access parent methods in a derived class? Just type super? (a shortcut for the command %pinfo super) and you will find all the information regarding the super function. Appending ? or ?? to any command or variable gives you all the information you need about it, as shown here: In [1]: super? Typical use to call a cooperative superclass method: class C(B): def meth(self, arg): super(C, self).meth(arg) Using IPython as a system shell You can use the IPython command-line interface as an extended system shell. You can navigate throughout your filesystem and execute any system command. For instance, the standard Unix commands pwd, ls, and cd are available in IPython and work on Windows too, as shown in the following example: In [1]: pwd Out[1]: u'C:' In [2]: cd windows C:windows These commands are particular magic commands that are central in the IPython shell. There are dozens of magic commands and we will use a lot of them throughout this book. You can get a list of all magic commands with the %lsmagic command. Using the IPython magic commands Magic commands actually come with a % prefix, but the automagic system, enabled by default, allows you to conveniently omit this prefix. Using the prefix is always possible, particularly when the unprefixed command is shadowed by a Python variable with the same name. The %automagic command toggles the automagic system. In this book, we will generally use the % prefix to refer to magic commands, but keep in mind that you can omit it most of the time, if you prefer. Using the history Like the standard Python console, IPython offers a command history. However, unlike in Python's console, the IPython history spans your previous interactive sessions. In addition to this, several key strokes and commands allow you to reduce repetitive typing. In an IPython console prompt, use the up and down arrow keys to go through your whole input history. If you start typing before pressing the arrow keys, only the commands that match what you have typed so far will be shown. In any interactive session, your input and output history is kept in the In and Out variables and is indexed by a prompt number. The _, __, ___ and _i, _ii, _iii variables contain the last three output and input objects, respectively. The _n and _in variables return the nth output and input history. For instance, let's type the following command: In [4]: a = 12 In [5]: a ** 2 Out[5]: 144 In [6]: print("The result is {0:d}.".format(_)) The result is 144. In this example, we display the output, that is, 144 of prompt 5 on line 6. Tab completion Tab completion is incredibly useful and you will find yourself using it all the time. Whenever you start typing any command, variable name, or function, press the Tab key to let IPython either automatically complete what you are typing if there is no ambiguity, or show you the list of possible commands or names that match what you have typed so far. It also works for directories and file paths, just like in the system shell. It is also particularly useful for dynamic object introspection. Type any Python object name followed by a point and then press the Tab key; IPython will show you the list of existing attributes and methods, as shown in the following example: In [1]: import os In [2]: os.path.split<tab> os.path.split os.path.splitdrive os.path.splitext os.path.splitunc In the second line, as shown in the previous code, we press the Tab key after having typed os.path.split. IPython then displays all the possible commands. Tab Completion and Private Variables Tab completion shows you all the attributes and methods of an object, except those that begin with an underscore (_). The reason is that it is a standard convention in Python programming to prefix private variables with an underscore. To force IPython to show all private attributes and methods, type myobject._ before pressing the Tab key. Nothing is really private or hidden in Python. It is part of a general Python philosophy, as expressed by the famous saying, "We are all consenting adults here." Executing a script with the %run command Although essential, the interactive console becomes limited when running sequences of multiple commands. Writing multiple commands in a Python script with the .py file extension (by convention) is quite common. A Python script can be executed from within the IPython console with the %run magic command followed by the script filename. The script is executed in a fresh, new Python namespace unless the -i option has been used, in which case the current interactive Python namespace is used for the execution. In all cases, all variables defined in the script become available in the console at the end of script execution. Let's write the following Python script in a file called script.py: print("Running script.") x = 12 print("'x' is now equal to {0:d}.".format(x)) Now, assuming we are in the directory where this file is located, we can execute it in IPython by entering the following command: In [1]: %run script.py Running script. 'x' is now equal to 12. In [2]: x Out[2]: 12 When running the script, the standard output of the console displays any print statement. At the end of execution, the x variable defined in the script is then included in the interactive namespace, which is quite convenient. Quick benchmarking with the %timeit command You can do quick benchmarks in an interactive session with the %timeit magic command. It lets you estimate how much time the execution of a single command takes. The same command is executed multiple times within a loop, and this loop itself is repeated several times by default. The individual execution time of the command is then automatically estimated with an average. The -n option controls the number of executions in a loop, whereas the -r option controls the number of executed loops. For example, let's type the following command: In[1]: %timeit [x*x for x in range(100000)] 10 loops, best of 3: 26.1 ms per loop Here, it took about 26 milliseconds to compute the squares of all integers up to 100000. Quick debugging with the %debug command IPython ships with a powerful command-line debugger. Whenever an exception is raised in the console, use the %debug magic command to launch the debugger at the exception point. You then have access to all the local variables and to the full stack traceback in postmortem mode. Navigate up and down through the stack with the u and d commands and exit the debugger with the q command. See the list of all the available commands in the debugger by entering the ? command. You can use the %pdb magic command to activate the automatic execution of the IPython debugger as soon as an exception is raised. Interactive computing with Pylab The %pylab magic command enables the scientific computing capabilities of the NumPy and matplotlib packages, namely efficient operations on vectors and matrices and plotting and interactive visualization features. It becomes possible to perform interactive computations in the console and plot graphs dynamically. For example, let's enter the following command: In [1]: %pylab Welcome to pylab, a matplotlib-based Python environment [backend: TkAgg]. For more information, type 'help(pylab)'. In [2]: x = linspace(-10., 10., 1000) In [3]: plot(x, sin(x)) In this example, we first define a vector of 1000 values linearly spaced between -10 and 10. Then we plot the graph (x, sin(x)). A window with a plot appears as shown in the following screenshot, and the console is not blocked while this window is opened. This allows us to interactively modify the plot while it is open. Using the IPython Notebook The Notebook brings the functionality of IPython into the browser for multiline textediting features, interactive session reproducibility, and so on. It is a modern and powerful way of using Python in an interactive and reproducible way To use the Notebook, call the ipython notebook command in a shell (make sure you have installed the required dependencies). This will launch a local web server on the default port 8888. Go to http://127.0.0.1:8888/ in a browser and create a new Notebook. You can write one or several lines of code in the input cells. Here are some of the most useful keyboard shortcuts: Press the Enter key to create a new line in the cell and not execute the cell Press Shift + Enter to execute the cell and go to the next cell Press Alt + Enter to execute the cell and append a new empty cell right after it Press Ctrl + Enter for quick instant experiments when you do not want to save the output Press Ctrl + M and then the H key to display the list of all the keyboard shortcuts Customizing IPython You can save your user preferences in a Python file; this file is called an IPython profile. To create a default profile, type ipython profile create in a shell. This will create a folder named profile_default in the ~/.ipython or ~/.config/ ipython directory. The file ipython_config.py in this folder contains preferences about IPython. You can create different profiles with different names using ipython profile create profilename, and then launch IPython with ipython --profile=profilename to use that profile. The ~ directory is your home directory, for example, something like /home/ yourname on Unix, or C:Usersyourname or C:Documents and Settings yourname on Windows. Summary We have gone through 10 of the most interesting features offered by IPython in this article. They essentially concern the Python and shell interactive features, including the integrated debugger and profiler, and the interactive computing and visualization features brought by the NumPy and Matplotlib packages. Resources for Article : Further resources on this subject: Advanced Matplotlib: Part 1 [Article] Python Testing: Installing the Robot Framework [Article] Running a simple game using Pygame [Article]
Read more
  • 0
  • 0
  • 3681

article-image-stream-grouping
Packt
26 Aug 2014
7 min read
Save for later

Stream Grouping

Packt
26 Aug 2014
7 min read
In this article, by Ankit Jain and Anand Nalya, the authors of the book Learning Storm, we will cover different types of stream groupings. (For more resources related to this topic, see here.) When defining a topology, we create a graph of computation with a number of bolt-processing streams. At a more granular level, each bolt executes as multiple tasks in the topology. A stream will be partitioned into a number of partitions and divided among the bolts' tasks. Thus, each task of a particular bolt will only get a subset of the tuples from the subscribed streams. Stream grouping in Storm provides complete control over how this partitioning of tuples happens among many tasks of a bolt subscribed to a stream. Grouping for a bolt can be defined on the instance of the backtype.storm.topology.InputDeclarer class returned when defining bolts using the backtype.storm.topology.TopologyBuilder.setBolt method. Storm supports the following types of stream groupings: Shuffle grouping Fields grouping All grouping Global grouping Direct grouping Local or shuffle grouping Custom grouping Now, we will look at each of these groupings in detail. Shuffle grouping Shuffle grouping distributes tuples in a uniform, random way across the tasks. An equal number of tuples will be processed by each task. This grouping is ideal when you want to distribute your processing load uniformly across the tasks and where there is no requirement of any data-driven partitioning. Fields grouping Fields grouping enables you to partition a stream on the basis of some of the fields in the tuples. For example, if you want that all the tweets from a particular user should go to a single task, then you can partition the tweet stream using fields grouping on the username field in the following manner: builder.setSpout("1", new TweetSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")) Fields grouping is calculated with the following function: hash (fields) % (no. of tasks) Here, hash is a hashing function. It does not guarantee that each task will get tuples to process. For example, if you have applied fields grouping on a field, say X, with only two possible values, A and B, and created two tasks for the bolt, then it might be possible that both hash (A) % 2 and hash (B) % 2 are equal, which will result in all the tuples being routed to a single task and other tasks being completely idle. Another common usage of fields grouping is to join streams. Since partitioning happens solely on the basis of field values and not the stream type, we can join two streams with any common join fields. The name of the fields do not need to be the same. For example, in order to process domains, we can join the Order and ItemScanned streams when an order is completed: builder.setSpout("1", new OrderSpout()); builder.setSpout("2", new ItemScannedSpout()); builder.setBolt("joiner", new OrderJoiner()) .fieldsGrouping("1", new Fields("orderId")) .fieldsGrouping("2", new Fields("orderRefId")); All grouping All grouping is a special grouping that does not partition the tuples but replicates them to all the tasks, that is, each tuple will be sent to each of the bolt's tasks for processing. One common use case of all grouping is for sending signals to bolts. For example, if you are doing some kind of filtering on the streams, then you have to pass the filter parameters to all the bolts. This can be achieved by sending those parameters over a stream that is subscribed by all bolts' tasks with all grouping. Another example is to send a reset message to all the tasks in an aggregation bolt. The following is an example of all grouping: builder.setSpout("1", new TweetSpout()); builder.setSpout("signals", new SignalSpout()); builder.setBolt("2", new TweetCounter()).fieldsGrouping("1", new Fields("username")).allGrouping("signals"); Here, we are subscribing signals for all the TweetCounter bolt's tasks. Now, we can send different signals to the TweetCounter bolt using SignalSpout. Global grouping Global grouping does not partition the stream but sends the complete stream to the bolt's task with the smallest ID. A general use case of this is when there needs to be a reduce phase in your topology where you want to combine results from previous steps in the topology in a single bolt. Global grouping might seem redundant at first, as you can achieve the same results with defining the parallelism for the bolt as one and setting the number of input streams to one. Though, when you have multiple streams of data coming through different paths, you might want only one of the streams to be reduced and others to be processed in parallel. For example, consider the following topology. In this topology, you might want to route all the tuples coming from Bolt C to a single Bolt D task, while you might still want parallelism for tuples coming from Bolt E to Bolt D. Global grouping This can be achieved with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setSpout("b", new SpoutB()); builder.setBolt("c", new BoltC()); builder.setBolt("e", new BoltE()); builder.setBolt("d", new BoltD()) .globalGrouping("c") .shuffleGrouping("e"); Direct grouping In direct grouping, the emitter decides where each tuple will go for processing. For example, say we have a log stream and we want to process each log entry using a specific bolt task on the basis of the type of resource. In this case, we can use direct grouping. Direct grouping can only be used with direct streams. To declare a stream as a direct stream, use the backtype.storm.topology.OutputFieldsDeclarer.declareStream method that takes a Boolean parameter directly in the following way in your spout: @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declareStream("directStream", true, new Fields("field1")); } Now, we need the number of tasks for the component so that we can specify the taskId parameter while emitting the tuple. This can be done using the backtype.storm.task.TopologyContext.getComponentTasks method in the prepare method of the bolt. The following snippet stores the number of tasks in a bolt field: public void prepare(Map stormConf, TopologyContext context, OutputCollector collector) { this.numOfTasks = context.getComponentTasks("my-stream"); this.collector = collector; } Once you have a direct stream to emit to, use the backtype.storm.task.OutputCollector.emitDirect method instead of the emit method to emit it. The emitDirect method takes a taskId parameter to specify the task. In the following snippet, we are emitting to one of the tasks randomly: public void execute(Tuple input) { collector.emitDirect(new Random().nextInt(this.numOfTasks), process(input)); } Local or shuffle grouping If the tuple source and target bolt tasks are running in the same worker, using this grouping will act as a shuffle grouping only between the target tasks running on the same worker, thus minimizing any network hops resulting in increased performance. In case there are no target bolt tasks running on the source worker process, this grouping will act similar to the shuffle grouping mentioned earlier. Custom grouping If none of the preceding groupings fit your use case, you can define your own custom grouping by implementing the backtype.storm.grouping.CustomStreamGrouping interface. The following is a sample custom grouping that partitions a stream on the basis of the category in the tuples: public class CategoryGrouping implements CustomStreamGrouping, Serializable { // Mapping of category to integer values for grouping private static final Map<String, Integer> categories = ImmutableMap.of ( "Financial", 0, "Medical", 1, "FMCG", 2, "Electronics", 3 ); // number of tasks, this is initialized in prepare method private int tasks = 0; public void prepare(WorkerTopologyContext context, GlobalStreamId stream, List<Integer> targetTasks) { // initialize the number of tasks tasks = targetTasks.size(); } public List<Integer> chooseTasks(int taskId, List<Object> values) { // return the taskId for a given category String category = (String) values.get(0); return ImmutableList.of(categories.get(category) % tasks); } } Now, we can use this grouping in our topologies with the following code snippet: builder.setSpout("a", new SpoutA()); builder.setBolt("b", (IRichBolt)new BoltB()) .customGrouping("a", new CategoryGrouping()); The following diagram represents the Storm groupings graphically: Summary In this article, we discussed stream grouping in Storm and its types. Resources for Article: Further resources on this subject: Integrating Storm and Hadoop [article] Deploying Storm on Hadoop for Advertising Analysis [article] Photo Stream with iCloud [article]
Read more
  • 0
  • 0
  • 3647
article-image-making-3d-visualizations
Packt
26 Oct 2015
5 min read
Save for later

Making 3D Visualizations

Packt
26 Oct 2015
5 min read
 Python has become the preferred language of data scientists for data analysis, visualization, and machine learning. It features numerical and mathematical toolkits such as: Numpy, Scipy, Sci-kit learn, Matplotlib and Pandas, as well as a R-like environment with IPython, all used for data analysis, visualization and machine learning. In this article by Dimitry Foures and Giuseppe Vettigli, authors of the book Python Data Visualization Cookbook, Second Edition, we will see how visualization in 3D is sometimes effective and sometimes inevitable. In this article, you will learn the how 3D bars are created. (For more resources related to this topic, see here.) Creating 3D bars Although matplotlib is mainly focused on plotting and 2D, there are different extensions that enable us to plot over geographical maps, to integrate more with Excel, and plot in 3D. These extensions are called toolkits in matplotlib world. A toolkit is a collection of specific functions that focuses on one topic, such as plotting in 3D. Popular toolkits are Basemap, GTK Tools, Excel Tools, Natgrid, AxesGrid, and mplot3d. We will explore more of mplot3d in this recipe. The mpl_toolkits.mplot3d toolkit provides some basic 3D plotting. Plots supported are scatter, surf, line, and mesh. Although this is not the best 3D plotting library, it comes with matplotlib, and we are already familiar with this interface.   Getting ready Basically, we still need to create a figure and add desired axes to it. Difference is that we specify 3D projection for the figure, and the axes we add is Axes3D. Now, we can almost use the same functions for plotting. Of course, the difference is the arguments passed. For we now have three axes, which we need to provide data for. For example, the mpl_toolkits.mplot3d.Axes3D.plot function specifies the xs, ys, zs, and zdir arguments. All others are transferred directly to matplotlib.axes.Axes.plot. We will explain these specific arguments: xs,ys: These are coordinates for X and Y axis zs: These are value(s) for Z axis. Can be one for all points, or one for each point zdir: These values choose what will be the z-axis dimension (usually this is zs, but can be xs, or ys) There is a rotate_axes method in module mpl_toolkits.mplot3d.art3d that contains 3D artist code and functions to convert 2D artists into 3D versions, which can be added to an Axes3D to reorder coordinates so that the axes are rotated with zdir along. The default value is z. Prepending the axis with a '-' does the inverse transform, so zdir can be x, -x, y, -y, z, or -z. How to do it... This is the code to demonstrate the plotting concept explained in the preceding section: import random import numpy as np import matplotlib as mpl import matplotlib.pyplot as plt import matplotlib.dates as mdates from mpl_toolkits.mplot3d import Axes3D mpl.rcParams['font.size'] = 10 fig = plt.figure() ax = fig.add_subplot(111, projection='3d') for z in [2011, 2012, 2013, 2014]: xs = xrange(1,13) ys = 1000 * np.random.rand(12) color = plt.cm.Set2(random.choice(xrange(plt.cm.Set2.N))) ax.bar(xs, ys, zs=z, zdir='y', color=color, alpha=0.8) ax.xaxis.set_major_locator(mpl.ticker.FixedLocator(xs)) ax.yaxis.set_major_locator(mpl.ticker.FixedLocator(ys)) ax.set_xlabel('Month') ax.set_ylabel('Year') ax.set_zlabel('Sales Net [usd]') plt.show() This code produces the following figure: How it works... We had to do the same prep work as in 2D world. Difference here is that we needed to specify what "kind of backend." Then, we generate random data for supposed 4 years of sale (2011–2014). We needed to specify Z values to be the same for the 3D axis. The color we picked randomly from the color map set, and then we associated each Z order collection of xs, ys pairs we would render the bar series. There's more... Other plotting from 2D matplotlib are available here. For example, scatter() has a similar interface to plot(), but with added size of the point marker. We are also familiar with contour, contourf, and bar. New types that are available only in 3D are wireframe, surface, and tri-surface plots. For example, this code example, plots tri-surface plot of popular Pringle functions or, more mathematically, hyperbolic paraboloid:  from mpl_toolkits.mplot3d import Axes3D from matplotlib import cm import matplotlib.pyplot as plt import numpy as np n_angles = 36 n_radii = 8 # An array of radii # Does not include radius r=0, this is to eliminate duplicate points radii = np.linspace(0.125, 1.0, n_radii) # An array of angles angles = np.linspace(0, 2*np.pi, n_angles, endpoint=False) # Repeat all angles for each radius angles = np.repeat(angles[...,np.newaxis], n_radii, axis=1) # Convert polar (radii, angles) coords to cartesian (x, y) coords # (0, 0) is added here. There are no duplicate points in the (x, y) plane x = np.append(0, (radii*np.cos(angles)).flatten()) y = np.append(0, (radii*np.sin(angles)).flatten()) # Pringle surface z = np.sin(-x*y) fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_trisurf(x, y, z, cmap=cm.jet, linewidth=0.2) plt.show()  The code will give the following output:    Summary Python Data Visualization Cookbook, Second Edition, is for developers that already know about Python programming in general. If you have heard about data visualization but you don't know where to start, then the book will guide you from the start and help you understand data, data formats, data visualization, and how to use Python to visualize data. Many more visualization techniques have been illustrated in a step-by-step recipe-based approach to data visualization in the book. The topics are explained sequentially as cookbook recipes consisting of a code snippet and the resulting visualization. Resources for Article: Further resources on this subject: Basics of Jupyter Notebook and Python [article] Asynchronous Programming with Python [article] Introduction to Data Analysis and Libraries [article]
Read more
  • 0
  • 0
  • 3644

article-image-data-science-r
Packt
04 Jul 2016
16 min read
Save for later

Data Science with R

Packt
04 Jul 2016
16 min read
In this article by Matthias Templ, author of the book Simulation for Data Science with R, we will cover: What is meant bydata science A short overview of what Ris The essential tools for a data scientist in R (For more resources related to this topic, see here.) Data science Looking at the job market it is no doubt that the industry needs experts on data science. But what is data science and what's the difference to statistics or computational statistics? Statistics is computing with data. In computational statistics, methods and corresponding software are developed in a highly data-depended manner using modern computational tools. Computational statistics has a huge intersection with data science. Data science is the applied part of computational statistics plus data management including storage of data, data bases, and data security issues. The term data science is used when your work is driven by data with a less strong component on method and algorithm development as computational statistics, but with a lot of pure computer science topics related to storing, retrieving, and handling data sets. It is the marriage of computer science and computational statistics. As an example to show differences, we took the broad area of visualization. A data scientist is also interested in pure process related visualizations (airflows in an engine, for example),while in computational statistics, methods for visualization of data and statistical results are onlytouched upon. Data science is the management of the entire modelling process, from data collection to automatized reporting and presenting the results. Storage and managing data, data pre-processing (editing, imputation), data analysis, and modelling are included in this process. Data scientists use statistics and data-oriented computer science tools to solve the problems they face. R R has become an essential tool for statistics and data science(Godfrey 2013). As soon as data scientists have to analyze data, R might be the first choice. The opensource programming language and software environment, R, is currently one of the most widely used and popular software tools for statistics and data analysis. It is available at the Comprehensive R Archive Network (CRAN) as free software under the terms of the Free Software Foundation's GNU General Public License (GPL) in source code and binary form. The R Core Team defines R as an environment. R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. Base R includes: A suite of operators for calculations on arrays, mostly written in C and integrated in R Comprehensive, coherent, and integrated collection of methods for data analysis Graphical facilities for data analysis and display, either on-screen or in hard copy A well-developed, simple, and effective programming language thatincludes conditional statements, loops, user-defined recursive functions, and input and output facilities A flexible object-oriented system facilitating code reuse High performance computing with interfaces to compiled code and facilities for parallel and grid computing The ability to be extended with (add-on) packages An environment that allows communication with many other software tools Each R package provides a structured standard documentation including code application examples. Further documents(so called vignettes???)potentially show more applications of the packages and illustrate dependencies between the implemented functions and methods. R is not only used extensively in the academic world, but also companies in the area of social media (Google, Facebook, Twitter, and Mozilla Corporation), the banking world (Bank of America, ANZ Bank, Simple), food and pharmaceutical areas (FDA, Merck, and Pfizer), finance (Lloyd, London, and Thomas Cook), technology companies (Microsoft), car construction and logistic companies (Ford, John Deere, and Uber), newspapers (The New York Times and New Scientist), and companies in many other areas; they use R in a professional context(see also, Gentlemen 2009andTippmann 2015). International and national organizations nowadays widely use R in their statistical offices(Todorov and Templ 2012 and Templ and Todorov 2016). R can be extended with add-on packages, and some of those extensions are especially useful for data scientists as discussed in the following section. Tools for data scientists in R Data scientists typically like: The flexibility in reading and writing data including the connection to data bases To have easy-to-use, flexible, and powerful data manipulation features available To work with modern statistical methodology To use high-performance computing tools including interfaces to foreign languages and parallel computing Versatile presentation capabilities for generating tables and graphics, which can readily be used in text processing systems, such as LaTeX or Microsoft Word To create dynamical reports To build web-based applications An economical solution The following presented tools are related to these topics and helps data scientists in their daily work. Use a smart environment for R Would you prefer to have one environment that includes types of modern tools for scientific computing, programming and management of data and files, versioning, output generation that also supports a project philosophy, code completion, highlighting, markup languages and interfaces to other software, and automated connections to servers? Currently two software products supports this concept. The first one is Eclipse with the extensionSTATET or the modified Eclipse IDE from Open Analytics called Architect. The second is a very popular IDE for R called RStudio, which also includes the named features and additionally includes an integration of the packages shiny(RStudio, Inc. 2014)for web-based development and integration of R and rmarkdown(Allaire et al. 2015). It provides a modern scientific computing environment, well designed and easy to use, and most importantly, distributed under GPL License. Use of R as a mediator Data exchange between statistical systems, database systems, or output formats is often required. In this respect, R offers very flexible import and export interfaces either through its base installation but mostly through add-on packages, which are available from CRAN or GitHub. For example, the packages xml2(Wickham 2015a)allow to read XML files. For importing delimited files, fixed width files, and web log files, it is worth mentioning the package readr(Wickham and Francois 2015a)or data.table(Dowle et al. 2015)(functionfread), which are supposed to be faster than the available functions in base R. The packages XLConnect(Mirai Solutions GmbH 2015)can be used to read and write Microsoft Excel files including formulas, graphics, and so on. The readxlpackage(Wickham 2015b)is faster for data import but do not provide export features. The foreignpackages(R Core Team 2015)and a newer promising package called haven(Wickham and Miller 2015)allow to read file formats from various commercial statistical software. The connection to all major database systems is easily established with specialized packages. Note that theROBDCpackage(Ripley and Lapsley 2015)is slow but general, while other specialized packages exists for special data bases. Efficient data manipulation as the daily job Data manipulation, in general but in any case with large data, can be best done with the dplyrpackage(Wickham and Francois 2015b)or the data.tablepackage(Dowle et al. 2015). The computational speed of both packages is much faster than the data manipulation features of base R, while data.table is slightly faster than dplyr using keys and fast binary search based methods for performance improvements. In the author's viewpoint, the syntax of dplyr is much easier to learn for beginners as the base R data manipulation features, and it is possible to write thedplyr syntax using data pipelines that is internally provided by package magrittr(Bache and Wickham 2014). Let's take an example to see the logical concept. We want to compute a new variableEngineSizeas the square ofEngineSizefrom the data set Cars93. For each group, we want to compute the minimum of the new variable. In addition, the results should be sorted in descending order: data(Cars93, package = "MASS") library("dplyr") Cars93 %>%   mutate(ES2 = EngineSize^2) %>%   group_by(Type) %>%   summarize(min.ES2 = min(ES2)) %>%   arrange(desc(min.ES2)) ## Source: local data frame [6 x 2] ## ##      Type min.ES2 ## 1   Large   10.89 ## 2     Van    5.76 ## 3 Compact    4.00 ## 4 Midsize    4.00 ## 5  Sporty    1.69 ## 6   Small    1.00 The code is somehow self-explanatory, while data manipulation in base R and data.table needs more expertise on syntax writing. In the case of large data files thatexceed available RAM, interfaces to (relational) database management systems are available, see the CRAN task view on high-performance computingthat includes also information about parallel computing. According to data manipulation, the excellent packages stringr, stringi, and lubridate for string operations and date-time handling should also be mentioned. The requirement of efficient data preprocessing A data scientist typically spends a major amount of time not only ondata management issues but also on fixing data quality problems. It is out of the scope of this book to mention all the tools for each data preprocessing topic. As an example, we concentrate on one particular topic—the handling of missing values. The VIMpackage(Templ, Alfons, and Filzmoser 2011)(Kowarik and Templ 2016)can be used for visual inspection and imputation of data. It is possible to visualize missing values using suitable plot methods and to analyze missing values' structure in microdata using univariate, bivariate, multiple, and multivariate plots. The information on missing values from specified variables is highlighted in selected variables. VIM can also evaluate imputations visually. Moreover, the VIMGUIpackage(Schopfhauser et al., 2014)provides a point and click graphical user interface (GUI). One plot, a parallel coordinate plot, for missing values is shown in the following graph. It highlights the values on certain chemical elements. In red, those values are marked that contain the missing in the chemical element Bi. It is easy to see missing at random situations with such plots as well as to detect any structure according to the missing pattern. Note that this data is compositional thus transformed using a log-ratio transformation from the package robCompositions(Templ, Hron, and Filzmoser 2011): library("VIM") data(chorizonDL, package = "VIM") ## for missing values x <- chorizonDL[,c(15,101:110)] library("robCompositions") x <- cenLR(x)$x.clr parcoordMiss(x,     plotvars=2:11, interactive = FALSE) legend("top", col = c("skyblue", "red"), lwd = c(1,1),     legend = c("observed in Bi", "missing in Bi")) To impute missing values,not onlykk-nearest neighbor and hot-deck methods are included, but also robust statistical methods implemented in an EMalgorithm, for example, in the functionirmi. The implemented methods can deal with a mixture of continuous, semi-continuous, binary, categorical, and count variables: any(is.na(x)) ## [1] TRUE ximputed <- irmi(x) ## Time difference of 0.01330566 secs any(is.na(ximputed)) ## [1] FALSE Visualization as a must While in former times, results were presented mostly in tables and data was analyzed by their values on screen; nowadays visualization of data and results becomes very important. Data scientists often heavily use visualizations to analyze data andalso for reporting and presenting results. It's already a nogo to not make use of visualizations. R features not only it's traditional graphical system but also an implementation of the grammar of graphics book(Wilkinson 2005)in the form of the R package(Wickham 2009). Why a data scientist should make use of ggplot2? Since it is a very flexible, customizable, consistent, and systematic approach to generate graphics. It allows to define own themes (for example, cooperative designs in companies) and support the users with legends and optimal plot layout. In ggplot2, the parts of a plot are defined independently. We do not go into details and refer to(Wickham 2009)or(???), but here's a simple example to show the user-friendliness of the implementation: library("ggplot2") ggplot(Cars93, aes(x = Horsepower, y = MPG.city)) + geom_point() + facet_wrap(~Cylinders) Here, we mapped Horsepower to the x variable and MPG.city to the y variable. We used Cylinder for faceting. We usedgeom_pointto tell ggplot2 to produce scatterplots. Reporting and webapplications Every analysis and report should be reproducible, especially when a data scientist does the job. Everything from the past should be able to compute at any time thereafter. Additionally,a task for a data scientist is to organize and managetext,code,data, andgraphics. The use of dynamical reporting tools raise the quality of outcomes and reduce the work-load. In R, the knitrpackage provides functionality for creating reproducible reports. It links code and text elements. The code is executed and the results are embedded in the text. Different output formats are possible such as PDF,HTML, orWord. The structuring can be most simply done using rmarkdown(Allaire et al., 2015). markdown is a markup language with many features, including headings of different sizes, text formatting, lists, links, HTML, JavaScript,LaTeX equations, tables, and citations. The aim is to generate documents from plain text. Cooperate designs and styles can be managed through CSS stylesheets. For data scientists, it is highly recommended to use these tools in their daily work. We already mentioned the automated generation from HTML pages from plain text with rmarkdown. The shinypackage(RStudio Inc. 2014)allows to build web-based applications. The website generated with shiny changes instantly as users modify inputs. You can stay within the R environment to build shiny user interfaces. Interactivity can be integrated using JavaScript, and built-in support for animation and sliders. Following is a very simple example that includes a slider and presents a scatterplot with highlighting of outliers given. We do not go into detail on the code that should only prove that it is just as simple to make a web application with shiny: library("shiny") library("robustbase") ## Define server code server <- function(input, output) {   output$scatterplot <- renderPlot({     x <- c(rnorm(input$obs-10), rnorm(10, 5)); y <- x + rnorm(input$obs)     df <- data.frame("x" = x, "y" = y)     df$out <- ifelse(covMcd(df)$mah > qchisq(0.975, 1), "outlier", "non-outlier")     ggplot(df, aes(x=x, y=y, colour=out)) + geom_point()   }) }   ## Define UI ui <- fluidPage(   sidebarLayout(     sidebarPanel(       sliderInput("obs", "No. of obs.", min = 10, max = 500, value = 100, step = 10)     ),     mainPanel(plotOutput("scatterplot"))   ) )   ## Shiny app object shinyApp(ui = ui, server = server) Building R packages First, RStudio and the package devtools(Wickham and Chang 2016)make life easy when building packages. RStudio has a lot of facilities for package building, and it's integrated package devtools includes features for checking, building, and documenting a package efficiently, and includes roxygen2(Wickham, Danenberg, and Eugster)for automated documentation of packages. When code of a package is updated,load_all('pathToPackage')simulates a restart of R, the new installation of the package and the loading of the newly build packages. Note that there are many other functions available for testing, documenting, and checking. Secondly, build a package whenever you wrote more than two functions and whenever you deal with more than one data set. If you use it only for yourself, you may be lazy with documenting the functions to save time. Packages allow to share code easily, to load all functions and data with one line of code, to have the documentation integrated, and to support consistency checks and additional integrated unit tests. Advice for beginners is to read the manualWriting R Extensions, and use all the features that are provided by RStudio and devtools. Summary In this article, we discussed essential tools for data scientists in R. This covers methods for data pre-processing, data manipulation, and tools for reporting, reproducible work, visualization, R packaging, and writing web-applications. A data scientist should learn to use the presented tools and deepen the knowledge in the proposed methods and software tools. Having learnt these lessons, a data scientist is well-prepared to face the challenges in data analysis, data analytics, data science, and data problems in practice. References Allaire, J.J., J. Cheng, Xie Y, J. McPherson, W. Chang, J. Allen, H. Wickham, and H. Hyndman. 2015.Rmarkdown: Dynamic Documents for R.http://CRAN.R-project.org/package=rmarkdown. Bache, S.M., and W. Wickham. 2014.magrittr: A Forward-Pipe Operator for R.https://CRAN.R-project.org/package=magrittr. Dowle, M., A. Srinivasan, T. Short, S. Lianoglou, R. Saporta, and E. Antonyan. 2015.Data.table: Extension of Data.frame.https://CRAN.R-project.org/package=data.table. Gentlemen, R. 2009. "Data Analysts Captivated by R's Power."New York Times.http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html. Godfrey, A.J.R. 2013. "Statistical Analysis from a Blind Person's Perspective."The R Journal5 (1): 73–80. Kowarik, A., and M. Templ. 2016. "Imputation with the R Package VIM."Journal of Statistical Software. Mirai Solutions GmbH. 2015.XLConnect: Excel Connector for R.http://CRAN.R-project.org/package=XLConnect. R Core Team. 2015.Foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase, ….http://CRAN.R-project.org/package=foreign. Ripley, B., and M. Lapsley. 2015.RODBC: ODBC Database Access.http://CRAN.R-project.org/package=RODBC. RStudio Inc. 2014.Shiny: Web Application Framework for R.http://CRAN.R-project.org/package=shiny. Schopfhauser, D., M. Templ, A. Alfons, A. Kowarik, and B. Prantner. 2014.VIMGUI: Visualization and Imputation of Missing Values.http://CRAN.R-project.org/package=VIMGUI. Templ, M., A. Alfons, and P. Filzmoser. 2011. "Exploring Incomplete Data Using Visualization Techniques."Advances in Data Analysis and Classification6 (1): 29–47. Templ, M., and V. Todorov. 2016. "The Software Environment R for Official Statistics and Survey Methodology."Austrian Journal of Statistics45 (1): 97–124. Templ, M., K. Hron, and P. Filzmoser. 2011.RobCompositions: An R-Package for Robust Statistical Analysis of Compositional Data. John Wiley; Sons. Tippmann, S. 2015. "Programming Tools: Adventures with R."Nature, 109–10. doi:10.1038/517109a. Todorov, V., and M. Templ. 2012.R in the Statistical Office: Part II. Working paper 1/2012. United Nations Industrial Development. Wickham, H. 2009.Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.http://had.co.nz/ggplot2/book. 2015a.Xml2: Parse XML.http://CRAN.R-project.org/package=xml2. 2015b.Readxl: Read Excel Files.http://CRAN.R-project.org/package=readxl. Wickham, H., and W. Chang. 2016.Devtools: Tools to Make Developing R Packages Easier.https://CRAN.R-project.org/package=devtools. Wickham, H., and R. Francois. 2015a.Readr: Read Tabular Data.http://CRAN.R-project.org/package=readr. 2015b.dplyr: A Grammar of Data Manipulation.https://CRAN.R-project.org/package=dplyr. Wickham, H., and E. Miller. 2015.Haven: Import SPSS,Stata and SAS Files.http://CRAN.R-project.org/package=haven. Wickham, H., P. Danenberg, and M. Eugster.Roxygen2: In-Source Documentation for R.https://github.com/klutometis/roxygen. Wilkinson, L. 2005.The Grammar of Graphics (Statistics and Computing). Secaucus, NJ, USA: Springer-Verlag New York, Inc. Resources for Article: Further resources on this subject: Adding Media to Our Site [article] Data Tables and DataTables Plugin in jQuery 1.3 with PHP [article] JavaScript Execution with Selenium [article]
Read more
  • 0
  • 0
  • 3621
Modal Close icon
Modal Close icon