Novell's Vision of Distributed Services

Articles and Tips: article

DREW MAJOR
Chief Scientist
Novell, Inc.

01 Nov 1995

This DevNote contains excerpts from a keynote address Drew Major delivered at Novell's tenth annual BrainShare conference, given in March of 1995. Drew discusses distributed services, CORBA, and Novell's Advanced File System, identifying several issues of interest to developers.

Distributed Services
CORBA
Benefits of Distributed Services
Advanced File System

Distributed Services

I'd like to discuss where we believe distributed services are going to go and what is going to happen to our platforms. First of all, I'd like to talk about where we believe the industry is going to go. I coined the term "distributed services" to talk about a particular concept, because the terms I had heard of before were inadequate - particularly "client-server." Client-server, to me, means connecting a server to a bunch of clients. That's very important, but that's only one dimension.

There's another equally important dimension that we're barely starting on that actually long term is going to be as critical or more critical than just dividing the apps between the client and server. And that is how the servers themselves, server to server, can communicate with other, how they work together to present a common service.

Client-server was tough to program. Server-to-server is going to be tough as well. It's a whole new dimension, a whole new set of problems. But yet it's very critical to continue moving our industry forward and improving the capabilities that we can deliver. We need to make those servers work together more. They've got to look like a common team. They have to work together as a single service.

In conjunction with this, there needs to be a change from a physical, server-centric view of everything to a logical view of the services. You need to log into the network and connect to the service, and not to the server. Let me explain some of the capabilities that I see, that Novell sees, that we need to have in this new environment of distributed services. The services themselves have to be written different. They have to be aware that there are other sibling services, other servers that they have to synchronize with and work closely with, and that requires a new dimension.

It's one thing to write something in a single box, and you kind of have direct control. But now dealing with external things and not necessarily with a client but now with another server who's equivalent that you're sharing work between, you're synchronizing with. That's hard. There's not too many things written that way yet. But that's entirely the direction that the industry needs to go.

One of the capabilities needs to be that if the servers are working together as a team, one of them goes down, the other servers need to be able to clean up after them. Perhaps with the assistance of a client, they need to clean up after each other.

You get other capabilities here. You get the capability of being able to add servers. If you add servers to the service, the external view stays the same, but the capacity increases. You also have to have the capability of dynamically moving resources and load back and forth between the servers.

Those are very important capabilities that can come from this world, but they don't come for free. You have to program them in. You have to think about it. You have to design your service for that. Now a number of people are thinking about this.

CORBA

One area that we've been watching and paying quite close attention to has been the distributed services or distributed object works among CORBA (Common Object Request Broker Architecture). I should differentiate what I'm talking about with what's talked about there. What I'm talking about is more at the lower level, the services themselves synchronized.

Distributed objects is kind of a higher level way of communicating between there, but still the object itself has to be aware and has to do all this stuff. Microsoft is talking a bit about it. But their distributed objects, it's more really along the client-server line. They're doing some stuff with DCE and some stuff with DEC, and they've actually just lately become involved with CORBA, but they're focusing more just on the objects as from a client to a server. And no one I see is really focusing at the infrastructure level that we are of the server to server demands. We're kind of unique in that focus. But that's the way it will be implemented and that's where the value will be.

To make this happen, we have to have a very efficient server-to-server communications channel. We need to do something similar to what we do with NCP - make sure that going to the network is faster than to the local hard disk. We have to have that same type of high efficient, high performance channel. Otherwise when the servers are synchronizing with each other they will kill themselves in the overhead, talking to each other.

That's one of the problems with RPC and the technology and applications of the past. We will come up and we have already well into development a very fast way of communicating that way. We'll use different ways of marshalling and enabling that communication so in some sense it will look like either an IDL or RPC type of communication, but underneath the plumbing is going to be really fast. This is what I was talking about last year when I talked about DPP, or Distributed Parallel Processing, the capability of having these multiple servers synchronized in a very efficient way.

You also need to make distributed services for a distributed naming system, and connection management. A distributed naming system there would be based on NDS. You need management to manage the whole connection and provide kind of housekeeping and infrastructure support for the distributed service itself as well as you need transactional and synchronization support so that they can use either distributed lock manager.

Oracle Parallel serves as one example of a distributed service that works today, and it uses a distributed lock manager for synchronization. Also the Tuxedo functionality for distributed transactions is also extremely important.

What we're doing is building a set of services that are going to have this characteristic. At the same time we're going to build the infrastructure underneath to support it and an infrastructure will be made available to you as you decide you want to make your services distributed you can take advantage of that. In the meantime we will move ahead because many of our services need this functionality.

Benefits of Distributed Services

Let me quickly go over some of the values of it. You get fault tolerance, and things are more robust. If a server goes down, potentially maybe you wouldn't lose all the state sometimes you could lose some state, but at least things are a lot more robust because these guys are working together as a team. They are no longer islands. Today's servers are islands. They should work together. And with this next generation of software they can do that.

Now you'd ask, "What happens to SFT III in that case?" SFT III does synchronization at a different level. Actually it's fairly valuable in some cases. If you wrote a particular application and you had to have the application synchronizing with each other in different servers, physically synchronizing with each other at the application level, sometimes that synchronization could be such high overhead depending on the particular problem that it could make the thing run a lot slower. In that case sometimes that logic would be better placed in an SFT III environment where synchronization happens at a lot lower level, and then it would run faster actually than being distributed.

Therefore, you have two approaches to the same problem, and sometimes one is appropriate and sometimes the other. And with SFT III we have yet another way of providing that fault tolerance by synchronizing full states of the applications themselves, and it is synchronized at a lower level. So you want to be able to do either the SFT III or distributes services approach.

Distributed services will be more scalable. You'll be able to plug more servers in, to increase your capacity. Things will be more manageable. It's not isolated servers - it's one service. You can manage it as a single service. You're more flexible. You can reconfigure services more easily and transparently. It's more cost effective. You can use cheaper hardware to achieve scalability. Things are more secure because there are more firewalls and more breaks between systems. If one machine goes down it doesn't necessarily potentially drag the other machines down. Anyway, there's a lot of value in distributed services.

Advanced File System

What I'd like to quickly go through is an example to illustrate this even further, the example of our Advanced File System, how we're developing that and the capabilities it's going to have.

What we're really doing here, first of all, with the Advanced File System, we're breaking the naming service out from the storage service. And they'll be separate entities. We think that's very important for a number of reasons. What we're really trying to build here is very intelligent data storage, data storage that does a lot of intelligent things in terms of where data gets moved and how it is handled.

We'll be building basing a distributed naming service on some of our NDS technology, but it will have to be enhanced to meet the requirements of the file system directory. Because what will happen is, with the new naming system you'll have a single, network-wide view of the names and the files and the directories. It will be a single, logical, distributed file system type of view. Then the storage system has a separate service that will do the actual controlling of placement and storage of data on the disk.

A scenario can help illustrate some of the capabilities. When you initially installed it, what the naming service would do is go out and look at the whole network and get a snapshot - a physical view of what the network is today. Today networks, the way you find objects is you go to a server, and then a volume and directory. Well, the naming service in a sense would suck all that information up and build its own equivalent view of that as a logical view. So it becomes kind of a symbolic thing - sort of like symbolic links. It's a symbolic view that is equivalent to what it is physically there. From then on, accesses will go through the naming service prior to hitting the disks.

Multiple Views of Data

So now we've kind of put a level of indirection there between the naming and the actual storage. That gives us a lot more flexibility to add new views of the data. You can now take the data and reconstruct it and come up with an alternative single directory tree (for example) of the data. At the same time, you're going to preserve the old view.

So that way you don't have to go out and change the login scripts and reinstall the applications. They're all looking and thinking that they're opening the file in a certain way - going to the physical server, and so on. But now it's going through the naming service and that gives us now - because we have that extra level of indirection - the option to be able to move the data around where it really needs to go and migrate it and replicate it, and so on.

The view is preserved even while things change underneath. Over time perhaps the physical view goes away, and people start using the logical view, but you have multiple views of the data and that's how we preserve backwards compatibility.

Data Replication

Data will be managed in a more intelligent way. It will migrate and replicate according to its use. So if you have data in a server in Utah, and then you go to California, and connect up to the network, when you ask for some of that data it will follow you and it may end up replicated on the server in California while you're working there. When you come back up to Utah, it will follow you back. While you're in California, you ask for a WordPerfect application. Instead of going all the way over to Utah to download it, it would know that it has replicated, equivalent files in California, and it would give you that same thing. But you would still see the same view wherever you're at. That's very powerful. That's the power of logical naming. And that's the power of this advanced file system.

Fault Tolerance

You can add fault tolerance. When you create files, it can put them to two places initially, for example. So if one server goes down, that's okay you've got another one. And over time if the servers are backed up, one of the files is backed up the system can automatically get rid of the duplicate copy. We'll also have the capability of using local hard disks as a local cache and a replica of the network data for performance reasons.

We could still preserve security by encrypting. You know today a lot of times you put sensitive information on your local machines just so that no one else can see it. If you put it on the server, then maybe the administrator or somebody could see it. We'll preserve that level of security by giving you the option of encrypting the data prior to it being moved to the server. So you'll still have the same security even if the data has been replicated up on the server from the local hard disk.

Client-Network Integration

What we really need to do is change the paradigm. Thirteen years ago when we first began networking, what we did is we made the network look like a local hard disk. That was the whole way and that's actually what we continue to do. Of course we added symbolic drives to do that. We need to over time evolve into a different view. The view is that the local hard disk is just a part of the network. Right now we have a mixed view of the network. You deal with local things one way, and then you deal with the network other ways. It needs to look like one integrated thing. And what's local should just be an integrated part and a cache or a replica of the greater system.

Mobile Client. And that's kind of alluding to the one capability, which is the capability of being disconnected. If the local machine is a replica and has a cache of some of the data, then when you're disconnected you can still preserve the same view. It's transparent. Now the NetWare mobile client is the first phase of this. It's the first capability we'll merge in when the full NetWare wide capability is put in. When you're disconnected from some of the areas, you'll see the same view of the data, even though some of the areas won't be accessible at the time. And when you reconnect, things will auto-resynchronize.

Server Shutdown or Failure. There are several capabilities that can be best illustrated by talking about some shutdown or failure scenarios. Suppose you want to maintain a server. Today you need to get everyone to get off the server, right? Well, what you can do in this environment is go to the server and say, "Depopulate yourself." Then the server will move all the data that it has the only copy of away onto other servers and at some time (it may be awhile) come back and say, "It's cool now. You can shut me down. Everything is somewhere else. All the client connections have been transparently migrated." That's one scenario that will be very valuable.

On server failure, again what could happen is that the other servers will realize the one server is dead. So they'll say, "Okay what did we have?" They'll talk to all the clients. Or there will be an initiation type of thing where the client says, "Hey, can someone help me here? I used to be taking to this guy, and now he's dead. Can you help me? Do you have a copy of this file?" And things will reconcile. If the client is keeping all the state changes that he's been making and these file changes have been happening on two servers, then very quickly the client can continue with virtually no loss of data.

Another scenario is if your client hard disk fails. If you had that replicated up on the server, all you have to do is just plug another hard disk in and the thing will naturally move the data back down on your client and it'll be no big deal. It'll be very, very powerful. There's a lot of power in going logical, getting away from the physical view, into the logical view, and then taking advantage of the flexibility that provides.

Reducing Costs

This will reduce cost in a number of ways. First, you can get rid of a lot of the replicated data that you have now. You have these gigantic, mondo apps now that are 30 or 40 megabytes that you install on the server - of which maybe three or four megabytes is actually used. But the rest of it - who knows what to do with it? You can't get rid of it because you won't be able to. Who knows what all those DLLs are for? And usually you have to install it on each of the servers to make it all work, for performance reasons.

But what if the system now just goes in and realizes that some of these files are never accessed? So yeah they're everywhere, but the reality is there's only one copy somewhere. And then all the other ones are just migrated versions of that same copy.

We did an analysis of our servers internally and found that at least 25 percent of our data was replicated other packages and applications. So we can see some value here - even cost savings - by getting rid of duplicate, redundant data. It's going to be evolutionary, as I mentioned in the scenario. Users won't see a change.

We learned from our bindery-to-NDS experience. We're not going to do that again in terms of compatibility. We thought we got it right. We did about 90% of it. It was that last 10% gave us some problems as some of you probably experienced. But we're not going to do that again. This is a single architecture. It consolidates backup and migration. Backup now is just an extension of migration of the data.

Improving Performance

Integrating the client workstation with the network does a lot more things to improve performance. I mentioned we'll be adding distributed cache. The data itself gets moved closer to the consumers. The system moves data as a function of use. If it has a single copy at a particular point of the network and people are accessing it at another point a server will see that the data's going over the backbone, and it's being accessed a lot, the system can figure out to migrate that file, or to replicate that file, then all those consumers get the closer copy. It will be very intelligent. And it's actually not that hard of an algorithm to work out. Just look at access patterns, and figure out better ways of replicating and migrating the data.

Local hard disk becomes part of the network cache. That's very powerful. Many times you won't have to ping the network any more to get the data.

We're adding the capability in the storage system to do server striping. That would be for very high volume files, files - like video files, for example - files that are really hit a lot and are being changed a lot and if a particular server doesn't have enough bandwidth, you can take parts of the files now and stripe it across multiple servers as well as striping across multiple disks, which we've been able to do for a long time.

Another performance enhancement is the mounting of volumes, probably one that will be very appreciated by everyone. Now it's going to be within half a minute. And there will be a lot less need to run VRepair. Again all this stuff is in prototype, and it's looking really good. We're going to preserve the performance. They're adding, as I mentioned, mobile and disconnected support. Management is simpler. You get the fault tolerance.

One other area I didn't talk about with this naming system is enhancing the ways of querying, searching, and looking for data. You will be able to search for data by content and attributes, and will have some document management capabilities built in.

* Originally published in Novell AppNotes

Disclaimer

The origin of this information may be internal or external to Novell. While Novell makes all reasonable efforts to verify this information, Novell does not make explicit or implied claims to its validity.