This text summarizes the approaches we undertook while analyzing data sets for the study coordinated by prof.Jindrák for the Ministery of Health Care of Czech Republic.
What are the purposes, what we need to study, measure, compute, ... Given huge data sets describing both medical and economic attributes of hospitalizations we want to derive a flexible methodology allowing us to compare more specific subsets of patients. The comparison can be either against the whole set or of two or maybe even more subsets against all of the others.
What data we collect:
We started with rather huge matrix of data having about a dozen of columns and about 60,000 rows. This already is a set big enough to challenge ordinary tools as Office software packages. As we will explain in the sequel without a bit tricky approach no reasonable analysis of such data is possible using a naive approach. So forget now and forever about analyzing such data in MS Excel or in a similar tool. No way to obtain anything more complicated.
Through the years there are other rather huge sets of patients, some of them can be defined permanent, some of the being only virtual. Well, practically every subset of dozens of thousands of patiens is trictly speaking permanet. But is is impractical or even impossible to construct such subsets statically - they are way too many. They are much more than atoms in the Universe. So we must use a tricky way, construct the subsets of interest only dynamically as the end-user requires. It follows that no reasonable data cen be fixed and pre-computed. To the contrary, every analytical step must be done just-in-time on subset indicated without any possibility to use some precomputed values.
What are the typical subsets of patiens we know of in advance:
Where the other interesting data came from. The particular subsets of patiens so come from specific software we maintain. En expert system from antibiotics therapy, laboratory system for microbiology and others. These data sources were chosen simply because of the fact they already identified nearly all basic subsets of interest. They, definitely could come from elsewhere as pure enumerative sets of patient ids.
What kind of variables do we study. Every subset (the whole set is only a special case - simple the complete subset) can be characterized by some numbers. Some of which are simply counts, other ones are mean values. The interpretation is sometimes time based (days) and sometimes money based (mean price). Sometimes absolute, sometimes relative. Sometimes it makes sense to look at sizes compared in two different levels - sub-subset against subset and that all against the superset. It corresponds to the relation "special cases":"suspects":"all patients" Or unconditional probablity together with conditional under special assumptions. Every specific value defines as a flash back effect another subset of the set it was computen on. Let us go more into the details:
Summary - our goals. There are many random or random-like variables one can define working on our sets and their subsets. Most of our analysis can be performed using statistical methods, using the language of the theory of probablity, conditional (bayesian) probablities. So we are inclined to use mathematical tools and to re-define our approach as questions:
The essence of our analysis. All the above mathematical language is way too complicated for people working in hospitals. The only way how to effectivenly proceed with the data proved to be the following:
What software tools we used. The major problem is that commodity sofware is not well suited for working simultaneously with large data sets and a huge number of them. We tailored a special software used for analysis and laboratory and shaped it so that it can be used for the new purposes. In fact, we implanted the new software into the old one - the expert system for antibiotics therapy and infections surveillance. The motivation was that the same person who daily works with the patients can analyze the new data using old already well understood means, tools and approaches. Thus no need in training, one can start immediately exploring the data.
The main point is that the end-user can improvize with the data, randomly choose sets he/she likes to analyze. The answers of the software must be simple enough to look at one go. Simple numbers, short sequences, data that are easily translated into humen language as "Those patients are much more expensive" or "Despite very high expenses the destiny of those patiens is bad anyway". Suprisingly enough except for very trivial hypothesis it is extremely unlikely that even an expert could give off hand good estimates for variables one naturally wants to study. Thus everybody knows in advance that staying longer in hospital means the expenses are higher. To the contraty without a very precious analysis nobody can guess sufficiently exactly how suspected cases of hospital acquired infection combined with secondary diagnosis X transleate into higher expenses. Twice? Ten times? We shall see in the sequel.
What complicates the analysis. Any step in analysis must take no longer than a few seconds. There is a psychologically bases barrier which translates obviously: waiting too long for an answer means nobody will care and nobody will ask questions. Dynamic comparison of two sets each one having dozens of thousans of rows and a dozen of columns is not that trivial as one can expect. Good data structures had to be defined to store the results. Intermediate results must be stored and hidden so that they can help wherever set arithmetics can help solving the complexity. Various mean values and distributions are in principle non summable or there are no obvious ways how to combine easily or at all known precomputed results into new ones. Practically every new hypothesis means that an exhaustive search must be done and even then the reaction time must be in order of seconds.
What complicates the work even more. The approach of a mathematician or a computer scientist is very different from that of a doctor. The important information is hidden until the necessary software tool is shaped and data loaded. Then everything must be re-done because of the fact that some obvious (for a doctor) parameter is missing. Any reference to advanced mathematics must be avoided. Even the notion of random variable is a very complicated concept. Suggesting that there are other than normal or geometric distributions is out of question and that must be done in a hidden way. Surprisingly - the more trivial but consistent approaches are applied the better. The doctors quite well understand their data and the nature of these data. They only need a well shaped tool to handle them using common sense ... and they derive inceredibly interesting conjectures.