Kerberos authentication protocol is needed to secure Hadoop cluster. This is the only way to make Hadoop cluster secure.
If you look in workflow of kerberos authentication, you can get a feeling that it is very complex. But in reality, it is based on simple principles, learning them it will be easy to understand the complexity of Kerberos. Following concepts are important in Kerberos:
- Long term key
- Short term ticket
Kerberos is only about authentication. It confirms the personality of a user. Authorization is out of scope.
Kerberos is about authentication in distributed systems, which consists of many servers. That means you log in once, and then you can access other components without login. This is also known as single sign own.
Main component of Kerberos is key distribution center (KDC):
- KDC is database of generated keys, and it is used as a store of user/password pairs.
- By default this database is a file on file system. But you can also use LDAP as a database, but kerberos needs read/write access to LDAP.
- To enable high availability of KDC you can create slave KDC. But only master KDC is open for changes.
- Password is converted to key. Key is a password equivalent in kerberos.
- Authentication key should be manually distributed to other servers, which are under control by this KDC, in form of keytab.
- If password is changed, key should be regenerated and redistributed manually to servers in realm.
You can define a set of servers, which are under control by specific KDC, so they can be accessed by ticket issued by this KDC.
- In terminology of kerberos this set is called realm. Typically it is defined using DNS, but this is not only way.
- If you define more than one such sets (realms) you can established trust between them (one direction or both directions). In this case ticket is issued by one KDC can be used for authentication in trusted set of servers for another KDC.
In Kerberos notation user (actor) called principle and can be uniquely defined as <primary>/<instance>@<REALM>. There are different types of principles:
- Users: <username>/<role>@<REALM>
- Hosts: host/<hostname>@<REALM>
- Services: <servicenamy>/<hostname>@<REALM>.
- Passwords for services and hosts is automatically generated by KDC to some random string.
Your password is visible only for one server: KDC. After that you need either short term ticket or long term keytab. That means other servers do not need to know your password, and they do not receive your password. What they see is only authentication ticket, which is issued by KDC, if authentication is succeeded.
- On server side: user/password => keytab => long term key manually distributed to servers in kerberos realm. It can contain one or more principles.
- On client side: user/password => ticket => short term key for user to authenticate user session, received from KDC on request.
- Authentication in kerberos happens by verifying ticket with principle/key stored in keytab.
- After successful authentication ticket is created and cached in /tmp/ folder. Command klist shows available ticket.
- To avoid using same ticket for different services, there is a special ticket: ticket-granting ticket. This ticket uses to ask KDC to generate ticket for specific service.
- Each server in realm has a cache of keys in form of keytab (principle/key pairs), which issued by KDC. In this case server do not need every time connect KDC to check validity of ticket.
- Time is used to check validity of tickets. Clocks on servers should be synchronized.
You can authenticate either with password or with keytab, result of authentication is a ticket:
kinit alice@TEST.ORG.LAB kinit -kt hdfs.keytab hdfs/host1.test.lab@TEST.LAB klist
Example how it works with spark:
- Short-running process: user should use kinit with user/password. But ticket renewal process is needed if this approach is used for long running processes.
- Long-running process: during spark submit you need to specify: spark.yarn.principal and spark.yarn.keytab. In cluster mode this keytab will be copied to HDFS.