JLabs.Study2009.Methodology

JLabs.Study2009.Methodology
Mathematical and computer science methodology for Study2009

This text summarizes the approaches we undertook while analyzing data sets for the study coordinated by prof.Jindrák for the Ministery of Health Care of Czech Republic.

What are the purposes, what we need to study, measure, compute, ... Given huge data sets describing both medical and economic attributes of hospitalizations we want to derive a flexible methodology allowing us to compare more specific subsets of patients. The comparison can be either against the whole set or of two or maybe even more subsets against all of the others.

What data we collect:

All hospitalizations during two years, the time interval of interest is July 2006 through June 2008. The corresponding admissions and dimission can take much longer span. So we in fact studied all data that either ended or begun within our interval. Thus long and interesting/important hospitalizations are not disqualified.
With every hospitalization we took the following data, either from databases or computed using elementary arithmetics:
Id of the case, primary key. Some unique identifier.
Id of patient. Unique preferably anonymous identifier. We are not interested in particular cases but by massive sets of them.
Date of begining, admission.
Date of end, dimission.
Length of hospitalization in whole days rounded up. I.e. 1+difference of the two above items using date arithmetics.
Total care prices without implantates and other specialities.
Total price of care on ICU or ICU-like departments. Subset of the above.
Total price of implantates and other specialities.
Primary diagnosis.
Department of admission.
How the hospitalization ended up with special accent to mortality.

We started with rather huge matrix of data having about a dozen of columns and about 60,000 rows. This already is a set big enough to challenge ordinary tools as Office software packages. As we will explain in the sequel without a bit tricky approach no reasonable analysis of such data is possible using a naive approach. So forget now and forever about analyzing such data in MS Excel or in a similar tool. No way to obtain anything more complicated.

Through the years there are other rather huge sets of patients, some of them can be defined permanent, some of the being only virtual. Well, practically every subset of dozens of thousands of patiens is trictly speaking permanet. But is is impractical or even impossible to construct such subsets statically - they are way too many. They are much more than atoms in the Universe. So we must use a tricky way, construct the subsets of interest only dynamically as the end-user requires. It follows that no reasonable data cen be fixed and pre-computed. To the contrary, every analytical step must be done just-in-time on subset indicated without any possibility to use some precomputed values.

What are the typical subsets of patiens we know of in advance:

All patiens for which a speciment was analyzed in microbiology lab - large superset of all potentially complicated patients/
Suspect patients for acquired infection
Verified cases of quired infection, community or hospital.
MRSA (or other strain of interest) positively tested patients.
Presciption based subsets - e.g. all patient to which a specified drug was given (vancomycin, ...)
More tricky prescription based subsets - e.g. all patient to which prescription was radically redefined.
Invasive intervention based subsets - catetrizations, ...
Diagnosis based subsets (Diabetes melitus as a secondary diagnosis)
Operated patients binded to types of operations
Patients that died
Any union, intersection or difference of any subsets already given.
...

Where the other interesting data came from. The particular subsets of patiens so come from specific software we maintain. En expert system from antibiotics therapy, laboratory system for microbiology and others. These data sources were chosen simply because of the fact they already identified nearly all basic subsets of interest. They, definitely could come from elsewhere as pure enumerative sets of patient ids.

What kind of variables do we study. Every subset (the whole set is only a special case - simple the complete subset) can be characterized by some numbers. Some of which are simply counts, other ones are mean values. The interpretation is sometimes time based (days) and sometimes money based (mean price). Sometimes absolute, sometimes relative. Sometimes it makes sense to look at sizes compared in two different levels - sub-subset against subset and that all against the superset. It corresponds to the relation "special cases":"suspects":"all patients" Or unconditional probablity together with conditional under special assumptions. Every specific value defines as a flash back effect another subset of the set it was computen on. Let us go more into the details:

With every subset we want to know how many hospitalizations it does represent - its size.
With every value of i=1,2,3,... we want to know how many of the hospitalization had length in days equal to i.
There is always a well defined mean price vector (total, ICU-like and implantates-like)
The above data do behave as random variables having some distributions. We, naturally, want to compare such distributions for different subsets.
Any random variable defines automatically sub-subsets - e.g. all patients of subset A (e.g.catetrized) and staying <=X days. Probability levels.
...

Summary - our goals. There are many random or random-like variables one can define working on our sets and their subsets. Most of our analysis can be performed using statistical methods, using the language of the theory of probablity, conditional (bayesian) probablities. So we are inclined to use mathematical tools and to re-define our approach as questions:

What is the distribution of length of hospitalization under the condition that we choose hospitalizations from subset A, B or C.
How those distributions do compare one against the other.
How do behave mean prices as we pass from one subset to another.
...

The essence of our analysis. All the above mathematical language is way too complicated for people working in hospitals. The only way how to effectivenly proceed with the data proved to be the following:

A simple hypothesis is formed. E.g. verified cases of hospital acquired infections are longer to cure. Or those cases are more expensive.
The hypotheis is verified.
Quantification is obtained - e.g. mean length of verified cases is twice the expected one. Or the expenses are ten times higher.
Such hypothesis are re-formulated, special cases added or removed, colateral parameters taken into account (implatates price)
The meaning of the data is better understood and better hypothesis can be formulated, verified or not.
This iterative process repeats indefinitely until the good final hypothesis/questions can be even asked.
As the process continues, unexpected corelations are derived either systematically or simply by chance. Some iregularities are discovered automatically by software, some more through the interaction of the end-user with the corresponding software tools.

What software tools we used. The major problem is that commodity sofware is not well suited for working simultaneously with large data sets and a huge number of them. We tailored a special software used for analysis and laboratory and shaped it so that it can be used for the new purposes. In fact, we implanted the new software into the old one - the expert system for antibiotics therapy and infections surveillance. The motivation was that the same person who daily works with the patients can analyze the new data using old already well understood means, tools and approaches. Thus no need in training, one can start immediately exploring the data.
The main point is that the end-user can improvize with the data, randomly choose sets he/she likes to analyze. The answers of the software must be simple enough to look at one go. Simple numbers, short sequences, data that are easily translated into humen language as "Those patients are much more expensive" or "Despite very high expenses the destiny of those patiens is bad anyway". Suprisingly enough except for very trivial hypothesis it is extremely unlikely that even an expert could give off hand good estimates for variables one naturally wants to study. Thus everybody knows in advance that staying longer in hospital means the expenses are higher. To the contraty without a very precious analysis nobody can guess sufficiently exactly how suspected cases of hospital acquired infection combined with secondary diagnosis X transleate into higher expenses. Twice? Ten times? We shall see in the sequel.

What complicates the analysis. Any step in analysis must take no longer than a few seconds. There is a psychologically bases barrier which translates obviously: waiting too long for an answer means nobody will care and nobody will ask questions. Dynamic comparison of two sets each one having dozens of thousans of rows and a dozen of columns is not that trivial as one can expect. Good data structures had to be defined to store the results. Intermediate results must be stored and hidden so that they can help wherever set arithmetics can help solving the complexity. Various mean values and distributions are in principle non summable or there are no obvious ways how to combine easily or at all known precomputed results into new ones. Practically every new hypothesis means that an exhaustive search must be done and even then the reaction time must be in order of seconds.

What complicates the work even more. The approach of a mathematician or a computer scientist is very different from that of a doctor. The important information is hidden until the necessary software tool is shaped and data loaded. Then everything must be re-done because of the fact that some obvious (for a doctor) parameter is missing. Any reference to advanced mathematics must be avoided. Even the notion of random variable is a very complicated concept. Suggesting that there are other than normal or geometric distributions is out of question and that must be done in a hidden way. Surprisingly - the more trivial but consistent approaches are applied the better. The doctors quite well understand their data and the nature of these data. They only need a well shaped tool to handle them using common sense ... and they derive inceredibly interesting conjectures.

Comments: info@jlabs.cz