כנס 2012

הכנס השנתי של האיגוד הישראלי לסטטיסטיקה

אוניברסיטת תל אביב 22.5.2012

בחסות מיה מחשבים

עלות הכנס: 220 ש"ח. המחיר כולל ארוחת צהריים.

  • לחברי האיגוד: 180 ש"ח
  • לסטודנטים: 100 ש"ח בהצגת תעודת סטודנט

ארוחת צהריים תובטח רק לנרשמים מראש.

להרשמה לכנס לחצו על לינק זה

 

 

מיקום המליאה: אולם דן-דויד 003 (מס' 17 במפה)
חדר שני: אולם הולצבלט במסדרון מדעים מדויקים (מס' 24 במפה)

ועדת הכנס: סהרון רוסט (יו"ר), משה פולק, אמיל בשקנסקי

כל ההרצאות מלבד הרצאתו של פרופ' לארי בראון יהיו בעברית
תכנית הכנס

8:30-9:00

התכנסות, רישום וכיבוד קל

9:00-10:00

דברי פתיחה: רון קנת, נשיא האיגוד


הרצאת מליאה:
Variable Selection Insurance

פרופ' לארי בראון, אוניב' פנסילבניה

(תקציר ההרצאה)

10:00-10:20

הפסקת קפה ומעבר לכיתות

10:20-12:00

מושבים מקבילים מוזמנים Invited talks


דן-דויד 003
מושב 1: תיאוריה ומתודולוגיה
יו"ר: מיכה מנדל, האוניברסיטה העברית

(25 דקות להרצאה – תקצירי ההרצאות)

ORDANOVA: Analysis of Ordinal Variation and its Possible Applications
Tamar Gadrich, Ort Braude College

Detecting novel bivariate associations in large data sets
Yakir Reshef

Shewhart Revisited
Moshe Pollak, Hebrew University of Jerusalem

Optimal adaptive designs to maximize power in clinical trials with multiple treatments
David Azriel, Technion


אולם הולצבלט
מושב 2: סטטיסטיקה רשמית
יו"ר: לואיזה בורק, למ"ס

(25 דקות להרצאה – תקצירי ההרצאות)

אמידת ממוצע ריבועי הטעויות (MSE) לסדרות מנוכות עונתיות
שריף אבו גוש, למ"ס

הישגים לימודיים של מקבלי תואר ראשון בישראל
דוד מעגן, מירי דדש-אלון, למ"ס

בניית משפחות מנהליות על סמך מרשם אוכלוסין
תיאודור יצקוב, אלכסנדרה קצנלנבוגן, למ"ס

עת ללדת ועת לחיות: הקשר בין גיל האם בלידת הילד האחרון ואריכות ימים
אורלי מנור, האוניברסיטה העברית

12:00-13:30

ארוחת צהריים משותפת בגן הדקלים
כולל: האסיפה הכללית השנתית של האיגוד הישראלי לסטטיסטיקה
וחלוקת פרסי האיגוד

13:30-15:10

מושבים מקבילים מוזמנים Invited talks


דן-דויד 003
מושב 3: ביואינפורמטיקה וביוסטטיסטיקה
יו"ר: מלכה גורפיין, טכניון
(25 דקות להרצאה – תקצירי ההרצאות)

A consistent multivariate test of association based on ranks of distances
Ruth Heller, Tel-Aviv University

On the Robustness of the Adaptive LASSO to Model Misspecification
Yair Goldberg, Haifa University

Probabilistic Analysis of Saturation Mutagenesis
Yuval Nov, Haifa University

Accurate Estimation of Heritability in Genome Wide Studies using Random Effects Models
David Golan, Tel Aviv University

 


אולם הולצבלט
מושב 4: סטטיסטיקה תעשייתית
יו"ר: ענבל יהב, בר אילן
(25 דקות להרצאה – תקצירי ההרצאות)

The Gestalt in Graphs: Prediction Using Economic Networks
Tomer Geva, Tel Aviv University

The Visible Hand of Social Networks in Electronic Market
Gal Oestreicher-Singer, Tel Aviv University

Paying for Content or Paying for the Community? The Effect of Social Involvement on Subscribing to Media Web Sites
Liron Sivan, Tel Aviv University

האם "מצב הרוח" במדינה מאפשר לחזות עליות ברמת האבטלה?
בועז ארד, מיה מחשבים

15:10-15:20

הפסקה

15:20-16:40

מושבים מקבילים נתרמים Contributed talks


דן-דויד 003
מושב 5: הרצאות מהחוג לסטטיסטיקה באוניברסיטת תל אביב
יו"ר: סהרון רוסט, אוניב' תל אביב

(20 דקות להרצאה – תקצירי ההרצאות)

Bayesian FDR controlling procedures
Daniel Yekutieli, Tel Aviv University

The “Less Fitting, More Optimism” Paradox in Model Selection

Shachar Kaufman, Tel Aviv University

Revisiting The Statistical Analysis of the Israeli-Palestinian Conflict
Jonathan Rosenblatt, Tel Aviv University

Selective confidence intervals
Yoav Benjamini, Tel Aviv University


אולם הולצבלט
מושב 6: הרצאות נתרמות
יו"ר ייקבע
(20 דקות להרצאה – תקצירי ההרצאות)

Sojourn Time Estimation in an M/G/∞ Queue with Partial Information
Nafna Nelgabats, Haifa University

High throughput genome-wide scan for epistasis with implementation to Recombinant Inbred Lines (RIL) populations.
Pavel Goldstein, University of Haifa

Model With Acceptance Threshold in Social Networks
Alon Sela, Tel Aviv University

Variable selection by combinatorial optimization algorithms, with application to pharmacogenomics
Joseph Levy, Teva Pharmaceutical Industries

16:40-17:00

הפסקת קפה ומעבר לאולם

17:00-17:50

הרצאת מליאה: דמיון (הסתברותי) ומציאות (סטטיסטית) בשווקים פיננסיים
פרופ' יצחק מלכסון, אוניב' תל אביב

(תקציר ההרצאה)

תקצירי ההרצאות

הרצאות מליאה

Variable Selection Insurance
Lawrence D. Brown, Statistics Department, Wharton School, Univ of Pennsylvania

Among statisticians variable selection is a common and very dangerous activity. This talk will survey the dangers and then propose two forms of insurance to guarantee against the damages from this activity.
Conventional statistical inference requires that a specific model of how the data were generated be specified before the data are examined and analyzed. Yet it is common in applications for a variety of variable selection procedures to be undertaken to determine a preferred model followed by statistical tests and confidence intervals computed for this “final” model. Such practices are typically misguided. The parameters being estimated depend on this final model, and post-model-selection sampling distributions may have unexpected properties that are very different from what is conventionally assumed. Confidence intervals and statistical tests do not perform as they should.
We address this dilemma within a standard linear-model framework. There is a numerical response of interest (Y) and a suite of possible explanatory variables, X1,…,Xp. to be used in a multiple linear regression. The data is gathered, a multivariate linear model is constructed using a selected subset of the potential X variables, and inference (estimates, confidence intervals, tests) is performed for the selected slope parameters.
We propose two types of insurance to guarantee against the deleterious effects of this type of variable selection. The first provides valid confidence intervals and tests based on the design matrix of the observed variables. It does not adherence to a pre-specified variable selection algorithm. This insurance may involve overly conservative procedures; but on the other hand, no less conservative procedure of this type will provide the desired insurance. The second type of insurance is purchased through use of a properly specified split-sample bootstrap. These intervals may be less conservative, but are not always so, and part of their price lies in the split-sample scheme that effectively sacrifices a portion of the data.
This is joint work with R. Berk, A. Buja, E. George, E. Pitkin, M. Traskin, K. Zhang and L. Zhao.

דמיון (הסתברותי) ומציאות (סטטיסטית) בשווקים פיננסיים
פרופ' יצחק מלכסון, אוניב' תל אביב

This talk will exhibit some issues in Finance in which a statistician's outlook may be of interest and relevance to Finance practitioners, besides being of intrinsic interest to statisticians. In most of these issues, you may expect questions to be vague but much more precise than answers. Here is a sample of these issues:

  • The Black&Scholes-type model assumes logarithms of prices of financial instruments to have i.i.d. multivariate Gaussian distributed increments. Some departures from this paradigm will be discussed, questioning (i) the constancy of covariance matrices in favor of "stochastic volatility", (ii) Gaussian in favor of Levy increments, (iii) independence of increments in favor of Time Series and Hidden Markov models.
  • Focusing on non-constant variances, interesting issues involve estimation of variances, whether by regular sums of squares, GARCH, Bayesian models, Incomplete Data, or "implied" values.
  • There is strong evidence that incremental log prices are not only non-Gaussian, but rather outright heavy-tailed. Value at Risk assessment takes polynomial tail decay very seriously, but can we really price anything if money has infinite moments of all orders?

בחזרה לתוכנית

מושב 1: תיאוריה ומתודולוגיה
יו"ר: מיכה מנדל, האוניברסיטה העברית

ORDANOVA: Analysis of Ordinal Variation and its Possible Applications

Tamar Gadrich, Emil Bashkansky Ort Braude College

In order to accelerate object evaluation, some measurement systems commonly use an ordinal scale (e.g. sticks results, quality estimation). We will present a novel method to analyze ordinal data variation. As in classical ANOVA for continual data, the ORDANOVA for ordinal data enables us to split the total variation into within and between components. This decomposition can be utilized for various practical applications such as classification, cluster analysis, distinguishing feature identification and so on.
Assume that any object is measured using an ordinal scale with K ordered categories. Since only comparisons of "equal/unequal" or "greater/less than" can be made between ordinal variable values, statistical measures of ordinal variables must be based on these limitations. Characterization of ordinal data can be done through location and dispersion measures .The focus of our presentation is on ORDANOVA (Ordinal data analysis of variation) – i.e., the analysis of ordinal data variation. The analysis should include the choice of appropriate ordinal dispersion measure, study of the possibility to split/separate total ordinal dispersion to the “within samples” and “between samples” components, determination of pertinent distinguishing statistics between the last, relevant decision making process and unbiased estimators of every component.

Detecting novel bivariate associations in large data sets
Yakir Reshef

Identifying interesting relationships between pairs of variables in large data sets is increasingly important. One way of doing so is to search such data sets for pairs of variables that are closely associated. This can be done by calculating some measure of dependence for each pair, ranking the pairs by their scores, and examining the top-scoring pairs. We outline two heuristic properties–generality and equitability–that the statistic we use to measure dependence should have in order for such a strategy to be effective.
We then present a measure of dependence for two-variable relationships, the maximal information coefficient (MIC), that appears to have these properties. MIC captures a wide range of associations both functional and not (generality), and for functional relationships provides a score that roughly equals the coefficient of determination (R2) of the data relative to the regression function (equitability). Finally, we show that MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships.

Shewhart Revisited
Moshe Pollak, Hebrew University of Jerusalem

The Shewhart control chart was the first to monitor an ongoing process and raise an alarm when it appears that the level has changed. We show that the Shewhart chart is optimal for the criterion of maximizing the probability of detecting a change upon its occurrence subject to an ARL to false alarm.
In the multivariate setting, applying the Shewhart procedure to each process separately is suboptimal. We create a generalized Shewhart procedure that is optimal for the aforementioned criterion. The results are illustrated in common settings.
Joint work with Abba Krieger

Optimal adaptive designs to maximize power in clinical trials with multiple treatments
David Azriel and Paul Feigin, Technion
We consider a clinical trial with three competing treatments and study designs that allocate subjects sequentially in order to maximize the power of relevant tests. Two different criteria are considered: the first is to find the best treatment and the second is to order all the three. The power converges to 1 in an exponential rate and we find the optimal allocation that maximizes this rate by large deviation theory. For the first criterion the optimal allocation has the plausible property that it assigns a small fraction of subjects to the inferior treatment. The optimal allocation depends on the unknown parameters and therefore in order to implement it, a sequential adaptive scheme is considered. At each stage of the trial the parameters are estimated and the next subject is allocated according to the estimated optimal allocation. We study the asymptotic properties of this design by large deviations theory and the small sample behavior by simulations.

בחזרה לתוכנית

מושב 2: סטטיסטיקה רשמית
יו"ר: לואיזה בורק,
למ"ס

אמידת ממוצע ריבועי הטעויות (MSE) לסדרות מנוכות עונתיות
שריף אבו גוש, למ"ס

בלמ"ס קיימות יותר מ-500 סדרות עתיות, חודשיות ורבעוניות, העוברות תהליך ניכוי עונתיות. ובאופן שוטף מתפרסם עבורן אומדנים לסדרה מנוכת עונתיות ולמגמה. X-12-ARIMA להלן (X12) היא השיטה הסטנדרטית לניכוי עונתיות בלמ"ס ובהרבה לשכות סטטיסטיות בעולם. השיטה X12 היא הרחבה שנעשתה לאורך השנים לשיטה X11. שיטה זו הינה אי-פרמטרית ומבוססת על ממוצעים נעים, סימטריים במרכז הסדרה ואסימטריים בקצוות, לאמידת מרכיבי הסדרה העתית.
אולם לשיטה X12, יש חיסרון בולט שקשור בכך שהיא אינה מפיקה אומדנים לממוצע ריבועי הטעויות (MSE) למרכיבי הסדרה העתית הנאמדים על ידה. לפיכך, כדי לאמוד את ה- MSE יש צורך להוסיף ל- X12שיטה נפרדת. מדד ה-MSE חשוב הן על מנת להעריך את מידת דיוק אומדי מרכיבי הסדרה העתית המתקבלים מהשיטה X12, והן להסקה הסטטיסטית הנעשית בנוגע למרכיבי הסדרה. נושא אמידת ה-MSE והשונות של אומדי מרכיבי הסדרה העתית המתקבלים מהשיטה X12, נחקר בעולם והוצעו מספר שיטות אמידה – חלקן פרמטריות וחלקן אי-פרמטריות.
מטרת עבודה זו היא לבחון שתי שיטות: האחת פרמטרית (Parametric Bootstrap method) והשנייה חצי פרמטרית (Semi-Parametric method), לאמידת ממוצע ריבועי הטעויות (MSE) לסדרות מנוכות עונתיות המתקבלות מהשיטה X12.
במחקר זה נשתמש בסדרות סימולציה, שנבנו ממודלים ARIMA ידועים, לבחינת המהימנות של כל אחת משתי השיטות. ובסדרות אמתיות מהלמ"ס ליישום שתי השיטות.
לפי תוצאות הסימולציה אפשר לראות שצורת אומדן ה-MSE לפי שתי השיטות דומה לצורת ה-MSE האמפירי, מבנה סימטרי, ערכים קבועים באמצע הסדרה ולקראת הקצוות הערכים מתחילים להתנדנד ולגדול. האומדן לפי השיטה ה- Parametric Bootstrapזהה לMSE- האמפירי ולעומתו האומדן לפי שיטת ה- Semi-Parametricגבוהה יותר. מתוצאות חישוב ה-MSE לסדרות האמתיות מקבלים תוצאות דומות לתוצאות הסימולציה.


הישגים לימודיים של מקבלי תואר ראשון בישראל
דוד מעגן, מירי דדש-אלון, למ"ס

בניית משפחות מנהליות על סמך מרשם אוכלוסין
תיאודור יצקוב, אלכסנדרה קצנלנבוגן, למ"ס

לצורך אפיון מצב החברה משתמשים בלמ"ס, כמו בלשכות סטטיסטיות אחרות בכל העולם, הן בנתוני סקרים והן במאגרי מידע מנהליים. למקורות מנהליים ישנם יתרונות וחסרונות בהשוואה למקורות הסקר. היתרונות העיקריים הם: נגישות קלה של הנתונים, כיסוי של האוכלוסייה כולה או לפחות חלקה הגדול. בין החסרונות: עיכובי עדכון, מידע חסר או לא נכון במשתנים מסוימים (חיסרון האופייני גם לנתוני סקר, במשתנים מסוימים). במפקד המשולב של הלמ"ס עלה צורך באובייקט מנהלי הקרוב למשקי בית דירתיים באוכלוסיה. לצורך כך בנינו אובייקט בשם משפחה מנהלית.
משפחה מנהלית מוגדרת כאדם אחד או קבוצת אנשים שלפי מידע מנהלי (מרשם האוכלוסין) גרים ביחד ובאופן רשמי נמצאים ביחסי קירבה משפחתיים. כדי לבנות משפחות מנהליות משתמשים לא רק במשתני כתובת, שלא תמיד מעודכנים ולעיתים חסרים במרשם האוכלוסין, אלא גם במשתנים אחרים מלאים ואמינים יותר: יחסי קירבה משפחתיים, גיל, דת, מין, מצב משפחתי. האלגוריתמים והפרוצדורות המוצעים כוללים משתנים ופרמטרים כדי לבנות בעזרתם משפחות מנהליות מסוגים שונים.
בדקנו את טיב האלגוריתמים על נתוני המפקד המשולב 2008, שכלל כ-15% פרטים מאוכלוסיית המדינה. בדקנו את ההנחות העיקריות של מודל המשפחות הגרעיניות. השוונו משפחות מנהליות למשקי בית דירתיים שנפקדו במפקד. השוויון בין משק בית למשפחה מנהלית נחשב כמתקיים רק כאשר כל הפרטים בהם היו זהים. חישבנו אומדנים: מתוך המשפחות המנהליות כ-64.2% שוות למשקי בית; מתוך משקי הבית כ-74.8% שווים למשפחות מנהליות.
למשפחות מנהליות נמצא שימוש לא רק במפקד המשולב אלא גם , בנושאי דגימה, חיפוש כתובות של בני משפחה וקישור רשומות בסקרים אחרים וגם בהפקת דוחות פני החברה בישראל. בנוסף למשפחות מנהליות הקרובות למשקי בית, בנינו לצרכים שונים משפחות מנהליות אחרות תחת אילוצים. המשפחות הינן: "לפי כתובת", בהן כל בני המשפחה רשומים באותה כתובת, "גרעיניות", המורכבות מקרובי משפחה מדרגה ראשונה ולא תלויות בכתובת הרשומה, "שומרות אזורים סטטיסטיים", בהן רשומים בני המשפחה באותו אזור סטטיסטי כאשר הוא ידוע.


עת ללדת ועת לחיות: הקשר בין גיל האם בלידת הילד האחרון ואריכות ימים
אורלי מנור, האוניברסיטה העברית

בחזרה לתוכנית

מושב 3: ביוסטטיסטיקה וביואינפורמטיקה
יו"ר: מלכה
גורפיין, טכניון

A consistent multivariate test of association based on ranks of distances
Ruth Heller, Tel-Aviv University

We are concerned with the problem of detecting whether an association of any kind exists between random vectors of any dimension. Few tests of independence exist to date that are consistent against all dependent alternatives. We propose a powerful test that is applicable in all dimensions, is robust to outliers, and is consistent against all alternatives. The test has a simple form and is easy to implement. We demonstrate its good power properties in simulations. The test can serve as a valuable tool to identify pairs of genes or gene sets that have a non-monotone relationship.
Joint work with Yair Heller and Malka Gorfine

On the Robustness of the Adaptive LASSO to Model Misspecification
Yair Goldberg, Haifa University

Penalization methods, such as adaptive LASSO, have been suggested in the context of variable selection. These methods were found useful not only for linear regression but also for generalized linear models (GLM) and Cox regression. When the model is correctly specified, it was shown that these estimators yield both consistent variable selection and oracle parameter estimation. What happens when the model is misspecified? Does the adaptive LASSO converge? Can it consistently select the right variables? In this talk we will try to answer these questions.

Probabilistic Analysis of Saturation Mutagenesis
Yuval Nov, Haifa University

Saturation Mutagenesis is a protein engineering technique, whereby few positions along a protein are identified as likely to accommodate beneficial mutations, and are then mutated randomly. The resulting protein variants are collectively called a "library," and are screened in the hope of discovering among them a highly active protein variant.
We propose a family of new criteria for determining the library size. When the number of all possible distinct variants is large, any of the top-performing variants (e.g., any of the top three) is likely to meet the design requirements, so the probability that the library contains at least one of them is a sensible criterion for determining the library size. By using a criterion of this type, one may significantly reduce the library size and thus save costs and labor while minimally compromising the quality of the best variant discovered. We present the probabilistic tools underlying the new criteria, and show that the existing criteria are needlessly conservative and wasteful, sometimes by orders of magnitude.


Accurate Estimation of Heritability in Genome Wide Studies using Random Effects Models
David Golan and Saharon Rosset, Tel Aviv University

Motivation: Random effects models have recently been introduced as an approach for analyzing genome wide association studies (GWAS), which allows estimation of overall heritability of traits without explicitly identifying the genetic loci responsible. Using this approach, Yang et al. (2010) have demonstrated that the heritability of height is much higher than the ∼10% associated with identified genetic factors. However, Yang et al. (2010) relied on a heuristic for performing estimation in this model. Results: We adopt the model framework of Yang et al. (2010) and develop a method for maximum likelihood (ML) estimation in this framework. Our method is based on MCEM (Wei et al., 1990), an expectation-maximization algorithm wherein a Markov chain Monte Carlo approach is used in the E-step. We demonstrate that this method leads to more stable and accurate heritability estimation compared to the approach of Yang et al. (2010), and it also allows us to find ML estimates of the portion of markers which are causal, indicating whether the heritability stems from a small number of powerful genetic factors or a large number of less powerful ones.

בחזרה לתוכנית

מושב 4: סטטיסטיקה תעשייתית
יו"ר: ענבל יהב, בר אילן

Is Oprah Contagious? Identifying Demand Spillovers in Online Networks
Gal Oestreicher-Singer, Tel Aviv University

We study the spread of exogenous demand shocks generated by book reviews featured on the Oprah Winfrey TV show and published in the New York Times through the online co-purchase recommendation network on Amazon.com. We analyze the co-purchase recommendation network on Amazon.com to determine how such exogenous events might affect the demand for books that were not explicitly mentioned in a review but are located “close” to reviewed books in the network. Using a difference-in-differences matched-sample approach, we identify the extent of variation caused by membership in this network. Our results show that the demand shock diffuses to books that are up to three links away from the reviewed book, and that this diffused shock persists for a substantial number of days. However, the depth and the magnitude of diffusion vary widely across books at the same network distance from reviewed products. We also describe how product characteristics, associative mixing and local network properties can explain this variation in the depth and persistence of contagion. Specifically, highly clustered local networks “trap” the diffused demand shocks, causing them to last longer and be more pronounced while restricting the distance of the shocks’ spread. Conversely, less clustered networks lead to wider contagions of lower magnitude and duration. We discuss the significance of these results and their implications for the design of networks of products as well as for optimizing digital marketing spillovers.
Joint work with Arun Sundararajan and Eyal Carmi.


Prediction in Economic Networks: Using the Implicit Gestalt in Product Graphs
Tomer Geva, Google

We define an economic network as a linked set of entities, where links are created by actual realizations of shared economic outcomes between entities. We analyze the predictive information contained in a specific type of economic network, namely, a product network, where the products are offered on a website and links designate pairs that were purchased simultaneously. Such Web-based product networks are becoming increasingly prevalent. Our data set covers a diverse set of 1 million books spanning over 400 categories over a period of two years with a total of over 70 million observations. Using autoregressive and neural network models, we demonstrate that an entity’s future demand is more accurately predicted by combining its historical demand with that of its neighbors than by considering its demand alone. In addition, network properties such as local clustering and centrality contribute significantly to the predictive accuracy of the neural network. To our knowledge, this is the first large-scale study showing that an economic network contains useful distributed information for demand prediction, and that this information is more effectively exploited by integrating composite structural network properties like PageRank explicitly into one’s predictive models.
Joint work with Vasant Dhar, Gal Oestreicher-Singer, Arun Sundararajan


Predicting Participation in Online Forums by Analyzing User Participation Patterns
Liron Sivan, Tel Aviv University

Online communities are a widespread phenomenon that has reshaped the way we communicate and carries implications for business, politics and society. An online community is a Web 2.0 platform that enables participants to communicate among themselves. As such, the success of a community depends less on information provided by the firm and more on the ability of its user base to generate content, respond to content supplied by others, and generally contribute to the quality and liveliness of the website. For the platform-providing firm, participation drives revenues, for example via advertising income. Thus, a fundamental concern is to understand the current and expected patterns of participation among users. The ability to predict participants’ loyalty to the online community and to identify the factors that affect it is essential in order to assess the health of the community and its business-related value.
In this research, we offer a novel approach for estimating the number of active users of an online community and predicting future participation of the community members, using a unique Israeli data set. We demonstrate how probability models (specifically the geometric/beta-Bernoulli model) traditionally used for customer loyalty analysis in the marketing literature can be successfully used to assess participation patterns, and thus the future value of online communities. We analyze data taken from multiple online forums on a major online site, and explore the factors that affect the usefulness of this approach. Our results suggest that, compared with current methods, our approach generates better estimations of future visits as well as future contribution levels.
Joint work with Gal Oestreicher-Singer, Barak Libai

האם "מצב הרוח" במדינה מאפשר לחזות עליות ברמת האבטלה?
בועז ארד, מיה מחשבים

אנו נמצאים כיום בעיצומה של מהפכת מידע דיגיטלית המבוססת על הצפת הרשת ומאגרי המידע בכמויות נתונים עצומות. מדובר במהפכה המבוססת על נגישות של כל אדם למידע והעצמת הפרט ע"י הפיכתו ליצרן מידע המופץ ללא הגבלה. למהפכה זאת משמעויות רבות למחקר מדעי, לישומים עיסקיים ולמהלכים פוליטיים ברמה המקומית והגלובלית. האו"ם השיק לאחרונה את פרוייקט "הדופק העולמי" בו נוטלת חברת האנליטיקה העסקית SAS חלק על מנת לבחון את היכולת לחזות משברים כלכליים ומצב אבטלה על ידי ניטור הרשתות החברתיות. בישראל בוצע פיילוט על ידי SAS שבחן את היחס לישראל על פי ניטור ציוצי "טוויטר".
על הפרוייקטים הללו, הכלים המאפשרים אותם והפוטנציאל הטמון בהם למחקר ומימוש מדיניות עיסקית וציבורית בהרצאתו של בועז ארד, מנתח מדיניות ויועץ תקשורת שיווקית בחברת מיה מחשבים בע"מ.

בחזרה לתוכנית

מושב 5: הרצאות מהחוג לסטטיסטיקה באוניברסיטת תל אביב
יו"ר: סהרון
רוסט, אוניברסיטת תל אביב

Bayesian FDR controlling procedures
Daniel Yekutieli, Tel Aviv University

I will explain the relation between Bayesian and frequentist control over the FDR.I will discuss the advantages of using Bayesian FDR controlling procedures instead of the Benjamini-Hochberg procedure and present joint work with Ruth Heller: eBayes FDR controlling procedures for discovering replicability in GWAS that are considerably more powerful than the BH procedure.

The “Less Fitting, More Optimism” Paradox in Model Selection
Shachar Kaufman and Saharon Rosset, Tel Aviv University
The difference between a modeling approach's expected in-sample and out-of-sample performance, termed optimism by Efron (1983), is an important and widely used concept in statistical modeling and model selection. It is a sensible view that this measure of the effect of "overfitting" is also a measure of how much "fitting" was performed. This notion is conveyed, for example, through the use of the concept of optimism to quantify the effective degrees of freedom, as in Hastie et al. (2009). If so, nested modeling approaches, which are clearly ordered in terms of fitting, should be similarly ordered in terms of optimism, as is the case for example for linear regression models. We show that this is not the case in general, including for widely used approaches like ridge regression (in its constrained form). It turns out that "more fitting, less optimism" is a common phenomenon as one moves away from the most standard cases. We then discuss sufficient conditions for monotonicity between fitting and optimism in nested models. We prove that this is guaranteed for two important classes of modeling approaches: First, for all symmetric linear smoothers; and second, for successive projection scenarios where the larger model is a linear regression model and the smaller model is a projection to a convex subset of the linear space (as in constrained ridge regression or LASSO).


Revisiting The Statistical Analysis of the Israeli-Palestinian Conflict
Jonathan Rosenblatt and David Golan, Tel Aviv University

Using casualty counts since September 2000, Jaeger & Paserman adopt a data driven approach to address the question "whether violence against Israelis and Palestinians affects the incidence and intensity of each side's reaction"[Jaeger, Paserman 2008]. Using B'Tselem data, they use an empirical impulse-response estimate to conclude:

"…the Israelis react in a predictable and statistically significant way to Palestinian violence against them while Palestinian actions are not related to Israeli violence, either through revenge or deterrence. Our results suggest that a cessation of Palestinian violence against Israel may eventually lead to an overall cessation of violence."

Two years later, using the same data augmented with Qassam firing data, Haushofer Biletzki and Kanwisher redo a similar analysis [Haushofer et al. 2010] concluding:

"… unlike prior studies, we show that Palestinian violence also shows a retaliatory pattern: (i) the firing of Qassam rockets in- creases sharply after Israelis kill Palestinians, and (ii) the probability (although not the number) of killings of Israelis by Palestinians increases after killings of Palestinians by Israel."

With great unease and discomfort, we adopted the data driven approach of this multifaceted conflict and try to correct for some of the flaws in the original analysis[Golan, Rosenblatt 2011]. Indeed, some of the findings in [Haushofer et al. 2010] vanish, if not reverse. Most notably, we find the statistical regularity of the conflict varies over time. In most periods, retaliation can explain a minuscule portion of events. Haushofer et al. did reply to this critique, by re-analyzing their data while adopting some of suggestions and rejecting others [Haushofer et al. 2011]. Their original results and conclusions hold under their suggested re-analysis. In this talk I present the mentioned papers, our critique, and discuss the authors' latest reply. Concluding there are questions better not addressed with quantitative approach.


Selective confidence intervals
Yoav Benjamini, Tel Aviv University

Selective inference involves drawing inferences on a subset of the parameters that has been selected based on the same data used for the inference.
I shall review some examples, making the point that it is both common and unavoidable in modern scientific research. I shall discuss the False non-Coverage Rate as an example of addressing the problem by “assessing performance on the average over the selected”, and present new confidence intervals for the selected large parameters that enjoy high power to determine the signs of the selected parameters.

בחזרה לתוכנית

מושב 6: הרצאות נתרמות
יו"ר ייקבע

Sojourn Time Estimation in an M/G/∞ Queue with Partial Information
Nafna Nelgabats, Haifa University

We propose an estimator for the CDF G of the sojourn time in an M/G/∞ queueing system, when the available data consists of the arrival and departure epochs alone, without knowing which arrival corresponds to which departure. The estimator is a generalization of an estimator proposed by Brown (1970), and is based on a functional relationship between G and the distribution of the time between a departure and the jth latest arrival preceding it. The estimator is shown to outperform Brown’s estimator, especially when the system is heavily loaded.
* Study done in cooperation with Prof. Gideon Weiss and Dr. Yuval Nov

High throughput genome-wide scan for epistasis with implementation to Recombinant Inbred Lines (RIL) populations.
Pavel Goldstein, Anat Reiner-Benaim, Abraham Korol, University of Haifa

Objectives: Expression QTL epistasis is particularly hard to track since the number of genes for which expression is measured can reach to tens of thousands. Even if only second-order interactions are searched for, the total number of combinations of loci-pairs and traits to be tested is very large, substantially reducing the power to detect any effects. This work proposes an integrative strategy to confront the challenges of statistical error, dimension reduction and epistasis modeling using a two-stage procedure for identifying eQTL epistasis.
Methods: The offered algorithm first constructs multi-trait complexes which one-dimensional representations of groups of genes. Then, epistasis is tested for among all combinations of complexes and loci-pairs: a hierarchical FDR controlling procedure is employed, starting with a "rough" search for pairs among relatively distant loci, followed by a higher resolution search only within the identified regions. Epistasis is tested for by fitting a NOIA model, which links the statistical and functional epistasis interpretations and provides estimates of orthogonal effects. The algorithm is evaluated in terms of heritability, testing power and FDR control using simulations, and is implemented on an Arabidopsis RIL genome.
Results: Aside from considerably reducing the number of tests, the multi-gene complexes increase both effect heritability and detection power. Implementing the algorithm on the Arabidopsis data, the number of tests was reduced by 2575 times, relative to testing all possible combinations. 1052 effects were identified while controlling the full hierarchical tree FDR at level 0.1. Of the interactions identified by the NOIA model, many did not include main effects, suggesting that epistasis is not conditioned on single-loci effects.
Conclusions: The proposed hierarchical search helps in identifying expression epistasis since it substantially increases power while controlling the expected proportion of false positives. A further increase in heritability and power is achieved by using groups rather than single genes

.
Model With Acceptance Threshold in Social Networks
Alon Sela, Tel Aviv University

A model describing the diffusion of ideas through human networks was built. We assumed that if the majority of "friend" of a person would accept a new idea, the chances of the person to accept it would grow. Initial results from the model point toward a basic structure that is beneficial for the spreading of information in different conditions through social groups. In projects dealing with cross-organizational changes or political campaigns, a semi-closed loop within the social network in which information flows permanently is a beneficial structure for project success. While having at least some connection to other nodes in the network, this loop spreads "bits" of information (messages) through the rest of the network. As the messages circulate through this inner loop, the threshold of acceptance of the new information by other nodes outside the loop is slowly reached till at a certain time, the "explosion" of information suddenly occurs when the new information becomes a norm spread by all. Our model demonstrates the seem to be "epidemical" behavior of ideas, as a phenomena that seem to start suddenly. Although the nodes with the higher rank (number of connection) are the most efficient in creating information percolation when considering costs, the central nodes are usually more expensive. In those cases, lower costs for assimilation of new ideas can be spent if the injecting nodes are nodes around the central hubs, or by strategies of gaining power over cheep nodes near the hubs. The help of correct timing of injection is important for reaching the diffusion of ideas with lower costs. Insights from the model are relevant to the spreading of new ideas, viral marketing, influence of news on financial markets, cost-effective immunology plans as complex system planning for more robustness against cascading failures by protecting nodes against the spreading of failures through a failure cascade.


Variable selection by combinatorial optimization algorithms, with application to pharmacogenomics
Joseph Levy, Teva Pharmaceutical Industries

Given a set of variables X<sub>1</sub>,…,X<sub>n</sub> that are associated with a response variable Y, it is often desired to select a subset of X<sub>1</sub>,…,X<sub>n</sub> that will enable predicting the value of Y.
Traditional approaches to the problem are based on measures of the association of each of the observed candidate predictors with the observed response. Such methods can be performed manually or automatically (as in stepwise regression and its variants, and other methods).
I propose to approach the variable selection problem as a combinatorial optimization problem. Application of heuristics commonly used for such problems provide superior results. The approach will be demonstrated using data from one of Teva's recent pharmacogenomics studies.

בחזרה לתוכנית

 

השאר תגובה