Foreign-origin inventors in the USA: Testing for Diaspora and Brain Gain Effects

We assess the role of ethnic ties in the diffusion of technical knowledge using a database of patents filed by US-resident inventors of foreign origin, identified by name analysis. We consider 10 leading source countries, both Asian and European, of highly skilled migration to the USA and test whether foreign inventors' patents are disproportionately cited by (i) co-ethnic migrants ('diaspora' effect), and (ii) inventors residing in their country of origin ('brain gain' effect). We find evidence of the diaspora effect for the Asian but not the European countries, with the exception of Russia. A diaspora effect does not necessarily translate into a brain gain effect, most notably for India where no such effect is detected. Neither does a brain gain effect occur solely in conjunction with a diaspora effect. Overall, diaspora and brain gain effects carry less weight than other channels of knowledge transmission, most notably co-invention networks and multinational companies. Acknowledgements: Unique identifiers for inventors in the EP-INV database come from the APE-INV project (Academic Patenting in Europe), funded by the European Science Foundation. The pilot project for assigning inventors to specific countries of origin was funded by the World Intellectual Property Organization (WIPO), which also made available to us the WIPO-PCT dataset. We received useful suggestions by participants to the following conferences : MEIDE (Santiago de Chile, November 2013), PATSTAT (Rio de Janeiro, November 2013), EUROLIO (Utrecht, January 2014), EPIP (Brussels, September 2014), AAG (Chicago, April 2015) and “Migration & Development” (Washington, May 2015); as well by participants to seminars at University College Dublin, London School of Economics, CRIOS-Bocconi, Kassel University, Collegio Carlo Alberto (Turin), IMT (Lucca), LUISS (Rome), GREThA-Bordeaux, UC Davis and UC Berkeley. Gianluca Tarasconi contributed decisively to the creation of the Ethnic-Inv dataset. Diego Useche provided valuable research assistance. We owe the tip on the IBM-GNR system to Lars Bo Jeppesen, while Curt Baginski assisted us in its implementation. Lissoni and Miguelez acknowledge financial support from the Regional Council of Aquitaine (Chaire d’Accueil programme and PROXIMO project).


Introduction
Recent research on the international mobility of scientists and engineers has seen the convergence of two streams in the literature. First, studies of the geography of innovation have explored the role of social ties in facilitating knowledge diffusion and in determining their spatial reach, including, in the case of migrant and foreign-origin inventors, the ties established with other members of their ethnic community (Agrawal et al. 2008). Second, migration and development studies have explored the extent to which highly skilled migrants contribute to innovation in their home countries through international knowledge flows (Kapur 2001, Kerr 2008, Agrawal et al. 2011, foreign direct investment (Foley and Kerr 2013), and entrepreneurial returnee migration (Nanda andKhanna 2010, Saxenian 2006).
While this convergence in the literature has enabled major advances, there is a clear need to consolidate the research field. We do not know, for example, the extent to which the social ties between migrants in the countries of destination extend to their countries of origin and thus contribute to international knowledge transfer. We also have little understanding of the differences across migration corridors. Here, existing studies focus primarily on what, in recent years, have been the fastest growing corridors (from China and India to the US) and tend to overlook the importance of Europe, not just as a region of destination, but also of origin. As of 2010/11, the top 10 contributors to the stock of highly educated migrants to OECD countries included the UK, Germany, and Poland. With over 3.6 million people, the combined stock of these three countries was 60% higher than that of India (top of the ranking) and more than twice that of China (third in the ranking, after the Philippines) (OECD 2015). According to the same source, Germany, Italy, and France have greater highly skilled emigration rates than China and India (between 6 and 9% compared to less than 5%), while the rates for the UK, Poland, and Romania stand at 11, 17, and 21%, respectively (Mayr andPeri 2008, Schiff andWang 2013).
Finally, official statistics do not provide details on the specific skills or jobs of the highly educated. Hence, the gathering of micro-evidence at the cross-country level is essential.
We contribute to this emerging literature by producing and analysing an extensive dataset of foreign-origin, US-based inventors from five Asian (China, India, Iran, Japan, and South Korea) and five European countries (France, Germany, Italy, Poland, and Russia). All data are novel and come from EP-INV, a database of uniquely identified inventors listed on patent applications at the European Patent Office (EPO), combined with name analysis based upon IBM-GNR © , a commercial database. Foreign-origin inventors include both foreign nationals and first or subsequent generation migrants who have acquired the US nationality, but still may contribute to knowledge diffusion, based on ethnic affiliation.
We analyse the knowledge flows generated by these inventors, as measured by forward citations to their patent applications. We state a "diaspora effect" to exist when US-resident inventors of the same foreign origin have a higher propensity to cite one another's patents, compared to patents by other inventors, other things being equal. We state a "brain gain effect" to exist when US-resident foreign-origin inventors are disproportionately cited by inventors in their home countries, so that the latter stand to gain from high skilled migration. We find evidence of a diaspora effect for Asian inventors, but not for their European counterparts, with the exception of Russian and (to a much lesser extent) German inventors. In general, ethnic ties appear to act as substitutes of co-location at the city level and of proximity in the social network of inventors. Their marginal effect does not appear to be as large as those of city-level co-location and short social distance.
In the cases of China, India and Russia, the diaspora effect presents an international dimension, as migrant inventors in countries other than the US also enjoy privileged access to knowledge produced by co-ethnic, US-based inventors. This, in turn, translates into a brain gain for China and Russia, though not for India. South Korea also presents a brain gain effect from its US-resident inventors. In the case of the advanced countries of France, Italy, and Japan, the brain gain effect is channelled through multinational enterprises. We detect no brain gain effect for Germany.
In what follows, we first survey the literature on migration and knowledge flows, with special emphasis on patent-based studies (section 2). We then present our research questions and data (section 3) and our results for the diaspora (section 4) and brain gain effects (section 5). Section 6 concludes. Appendixes discussing methodological issues and presenting robustness checks are available as additional online material.

Background literature
A -3

Localized knowledge flows and the role of social ties
Localized knowledge flows are a central concept in the geography of innovation (Breschi 2011). In the form of pure externalities, they are present in both Marshallian and Jacobian location theories (Henderson 1997, Ellison et al. 2010). Yet, their importance has been questioned both by New Economic Geography models (Krugman, 1991 and and by evolutionary location theories (Boschma and Frenken 2011). A key point of contention in this debate has been that of their measurement, fraught with technical and conceptual difficulties.
These difficulties were first addressed by Jaffe et al. (1993), who introduced the use of patent citations along with a simple, yet influential, methodology (from now on, the JTH method). The method makes use of two sets of patent pairs. The first includes a sample of cited patents and all their corresponding citing patents, excluding self-citations at the company level (cited-citing or "case" pairs); the second includes the same sample of cited patents, but here the citing ones are replaced by controls that have the same technological classification and priority year (cited-control or "control" pairs). After geo-localising patents at the city, state, or country level, a simple test of proportions is conducted to demonstrate that the proportion of colocalized cases is significantly higher than that of co-localized controls. The method can be generalized by means of regression analysis, with the probability of a citation occurring as the dependent variable, and the stacked sets of cited-citing and cited-control patent pairs as observations (Singh and Marx 2013). Subsequent technical refinements of the JTH method have involved the level of detail chosen for the technological classification of patents (Thompson and Fox-Kean, 2005;Henderson et al., 2005) and the origin of patent citations (applicant vs. examiner; ; for a critical discussion of this approach, see Alcacer and Gittelman, 2006;Breschi and Lissoni, 2005b). (19)(20)(21)(22)(23) Later studies have modified the JTH method as they seek to identify the actual mechanisms underpinning localized knowledge flows and their economic characteristics. Lissoni (2005a, 2009) show that a large proportion of localized patent citations are self-citations at the individual level, associated with inventors that move between or consult across firms in the same location or region. Other localized citations occur between individuals located at short geodesic distances in co-inventorship networks. Agrawal et al. (2006) show that social ties of this kind, once established locally, may be resistant to physical distance. patents are of lower quality than those produced at home by the same multinationals, the quality gap is narrowing in the case of China, which is indicative of effective knowledge transfer. The same does not apply to India.
As far as returnees' direct contributions are concerned, Agrawal et al. (2011) and Alnuaimi et al. (2012), based on studies of Indian inventors, manage to identify only a handful of returnees, suggesting that in the case of India, at least, these are not a massive source of knowledge transfer. Choudhury (2016), drawing on data from the Indian R&D facilities of one US multinational, finds that the most inventive employees are those working under returnee managers. This may be indicative of the latter's role as knowledge brokers between headquarters and subsidiaries.

Data issues
The increasing availability of inventor data has led several scholars to improve the quality and transparency of their data mining efforts. A key issue here is that of name disambiguation, that is, assigning a unique ID to inventors whose name or address might be reported differently on several patent documents (Marx et al. 2009, Raffo and Lhuillery 2009, Martínez et al. 2013, Li et al. 2014, Pezzoni et al. 2014, Ge et al. 2016). This has important implications for migration studies (more details in Appendix 1).
Ideally, a good disambiguation algorithm should minimize both false negatives (maximize "recall") and false positives (maximize "precision"). In practice, a trade-off exists, with high recall being much harder to achieve than high precision. High precision/low recall algorithms underestimate the number of personal self-citations and overestimate co-ethnic citations, as one self-citing ethnic inventor might be mistaken for two co-ethnic inventors citing one other. This latter bias can vary according to the inventors' country of origin, as disambiguation algorithms are language-sensitive.
To date, patent-based studies of migration and innovation have ignored these issues. Kerr (2008) and extensions employ non-disambiguated data; Agrawal et al. (2008Agrawal et al. ( , 2011 and Almeida et al. (2015) provide no details on disambiguation; and Alnuaimi et al. (2012) resort to "perfect matching", which functions as an extreme high precision/low recall algorithm.
Issues of precision and recall also emerge when assigning inventors to a country of origin or ethnic group based on names/surnames. Agrawal et al. (2008), for example, identify Indian inventors using a very narrow list of Indian surnames, considered as being both highly frequent in India and indicative of recent migration status. This, however, tends to limit attention to first-generation migrants and to assume that the strength of ethnic ties weakens with time. While this might be true, the assumption is not precise about the generational timing of this decay and it ignores the possibility of "ethnic revival" and "reverse brain drain" effects (Kuznetsov 2010, Kuznetsov 2006, Zweig 2006. Information on inventors' nationality, as used by Miguelez (2016), is an extremely practical substitute of name analysis, but also constitutes a low recall algorithm (established migrants that acquire the host country's nationality turn out as false negatives).
Technical concerns also arise with patent applicants. All studies claim to control for company self-citations; yet they remain silent on the methodologies adopted to identify companies and business groups. This contrasts with recent harmonization efforts (Peeters et al. 2010, Du Plessis et al. 2009, Thoma et al. 2010).
The use of raw or poorly treated applicant data is equivalent to applying a high precision/low recall disambiguation technique and leads to underestimation of company self-citations and overestimation of knowledge externalities. Internationally, it undervalues the role of multinationals as carriers of knowledge, and overvalues that of inventors' social ties.

Research questions and data
Below, we formulate our research questions and describe our dataset, keeping complexity to a minimum (details in Appendixes 1 and 2).

Research questions: diaspora and brain gain effects
We are interested in exploring how membership in the same foreign-origin community affects the diffusion of technical knowledge, both within the country of destination (CoD) and towards the country of origin (CoO). Emerging naming conventions, as reviewed in section 2, refer to within-community ties as "ethnic" or "co-ethnic" -imperfect terms that we nevertheless also adopt (for want of a better alternative).
However, when referring to individual inventors, we opt for "foreign-origin inventors", "inventors from the same country of origin" (both expressions including second-and subsequent-generation migrants) or, where more appropriate, "migrant inventors".
Ethnic ties exist independently of professional experiences and/or physical proximity. They may have been forged in the CoD (reflecting homophilic tendencies; Currarini et al., 2009) or inherited from the home country (as in chain migration). In both cases, they represent an instance of vitality and relevance of a foreign-origin community, to which we will refer as a diaspora. 1 We state a diaspora effect to exist when inventors from the same CoO and active in the same CoD have a higher propensity to cite one another's patents than those of other inventors, ceteris paribus, and excluding selfcitations at the company level. We test for the effect by adapting the JTH method (see section 2). We consider all cited patents signed by at least one foreign-origin inventor in the US, with citing and control patents having been filed by inventors (both local and of foreign origin) also located in the US. We then estimate the simple model: where the observations are patent pairs and the binary dependent variable takes value one if the two patents in the pair are linked by a citation. The main variable of interest, co-ethnicity, is a dummy variable equal to one when both patents in the pair have been invented by at least one inventor from the same CoO.
Spatial distance is determined on the basis of the inventors' addresses and measured both in terms of colocation and as a continuous variable. Social distance refers to geodesic distances in the network of inventors (Breschi and Lissoni, 2009). When one or both patents in a pair have multiple inventors, we consider minimum social and spatial distances. The other regressors refer primarily to the characteristics of the patents in the pair (in particular, the citing/control patents), based on the considerable body of literature examining the determinants of patent citations (Harhoff et al. 2003, Hall et al. 2005. We provide full details of our sampling scheme and specification in the next two subsections. We also conduct various robustness 1 "Diaspora" is also a somewhat imperfect term, used here to conform to current conventions. Dufoix (2008) shows how the term has progressively lost its original meaning in reference to Jewish history (the emphasis being on the absence of a home country) and is now used when speaking of any widely dispersed ethnic community (often in reference to its ties with the home country). In the economics of migration, the term is used even more casually, simply to indicate any stock of migrants (Beine et al., 2011). checks. This includes replacing the JTH citing-control patent methodology with one derived by , which consists in making use of inventor-added citations as cases and of examiner-added ones as controls, based on the assumption that the former, albeit noisy, may be more revealing of direct knowledge exchanges between inventors.
We state a brain gain effect to exist when patents by foreign-resident inventors from a given CoO are disproportionally cited by home-resident inventors (inventors residing in the same CoO). We consider citations mediated by ethnic ties separately from other citation sources, including self-citations of returnee inventors and multinational companies. To do so, we adapt once again the JTH method.
We sample all cited patents signed by foreign-origin inventors in the US, and as citing and control patents we select only those signed by inventors residing outside the US. We retain all patent pairs by the same inventor (most likely a returnee inventor) as well as pairs from the same company or business group, but control for them. We then estimate the following regression, by modifying equation (1) as follows: The dependent variable is the same as in (1), but the main regressor of interest is now home country, a dummy variable that takes value one if at least one inventor of the citing (control) patent resides in the cited inventor's CoO. Returnee and Same company are also dummy variables, which take value one if both patents in the pair have been signed by the same inventor, back in his CoO, or filed by the same company or business group, respectively. Other controls are as in (1), with some adaptations. 2 Notice that countries with strong education systems, but limited inventive activity of international standing (such as India, Russia, and China), may have fewer inventors of local origin at home than abroad. This suggests the possibility of some intra-ethnic global knowledge flows to exist, similar to trade flows between countries hosting the same ethnic minorities (Rauch andTrindade 2002, Felbermayr et al. 2010). We will refer 2 Most notably, spatial distance cannot be measured with co-location dummies, since, by construction, the inventors of cited and citing patents do not reside in the same country. Notice that networks of inventors may span across countries, which justifies including social distance in (2). Personal self-citations may occur (social distance =0) as when an Indian returnee inventor cites his own prior art, which he filed when abroad. Still, these are rare cases. Even more rare is the case of a migrant inventor who does not return, but move to a new CoD, and cite his own prior art from there.
to this as an "international diaspora" effect and test for it by re-inserting the co-ethnicity dummy in the regression.

Patent and inventor data
Our data result from matching the names and surnames of inventors in the EP-INV inventor database (Coffano and Tarasconi, 2014) Cowan and Zinovyeva, 2013;Sterzi, 2013;Nathan, 2015;Akcigit et al., 2016). Appendix 1 succinctly describes both the algorithm and how it was adapted to the needs of the present study.
Using USPTO data, as opposed to EPO, would appear a more natural choice for a study on US-resident inventors. In fact, Li et al. (2014) provide disambiguated data. For our purposes, however, the EP-INV dataset has richer information contents, especially on inventors' addresses, which come complete of harmonized street and zip code, from the OECD REGPAT database (Maraut et al. 2008). This provides crucial information for our disambiguation algorithm. In addition, mastering our own disambiguation algorithm allows us to calibrate it according to our needs (see Appendix 1). 3 Harmonization of applicant names is performed using the OECD HAN Database, as employed in recent PatStat releases. However, as this is far from perfect, we also carried out an ad hoc reconstruction of business groups, using Bureau van Dijk's Zephyr database on Mergers & Acquisitions. 4 The IBM-GNR system is a commercial product using information collected by the US immigration authorities in the first half of the 1990s. When fed with either a first name or a surname, IBM-GNR returns a list of Countries of Association (CoA) plus statistical information on the strength of the association. Consider for instance the inventor Rajiv Laroia. His first name, Rajiv, is associated with seven countries, including India, the UK and the Netherlands. As far as the cross-country distribution of the name (labelled "significance") is concerned, IBM-GNR suggests that around 80% of individuals named Rajiv originate from India, 10% from the UK, and around 1% only from each of the other countries. In the case of the within-country distribution (labelled "frequency"), Rajiv is deemed very common in India (in the top decile), but not elsewhere (5 th decile in the UK, bottom decile in the Netherlands). The surname Laroia is associated with just two countries, India (99% significance) and France (1% significance). 5 We treat this information using an additional, original algorithm (Ethnic-Inv), as described in Appendix 2.
Briefly, we select one and only one CoO by selecting the CoA most closely associated with the inventor's name and surname. We use three indicators: (a) the frequency of the first name in English-and Spanishspeaking countries (the two most spoken languages in the US); (b) the product of the significance of the first name and the surname, for each CoA; and (c) the stand-alone significance of the surname, for each CoA.
The higher (b) and (c), the more likely it is that a CoA actually corresponds to the inventor's CoO. The opposite holds for (a), since an inventor with, say, a typical Indian surname, but with John or Luis as a first name, is unlikely to be a first-generation immigrant to the US. He may be second-generation, but with no close ties to the diaspora, since his parents did not choose an ethnic name, opting for a distinctly local one.
In the case of Rajiv Laroia, his surname present high values of (b) and (c) when associated to India, while his name is a high-frequency one in India, a zero-frequency one in Spanish-speaking countries, and a low-frequency name in English-speaking countries (and only in those that host Indian minorities). We conclude he is either a first-generation Indian migrant or an insider member of the Indian community in the US.
Our algorithm assigns a specific weight to each of our three indicators (a)-(c), which we obtain by calibration against a benchmark dataset on the nationality of inventors resident in the US, based on PCT patent applications . We retain the weights for a "high recall" calibration, that is, one that minimizes false negatives (foreign-origin inventors from the selected CoO mistaken for locals), albeit at the price of low precision. We do so in order to avoid a bias in favour of positive co-ethnicity effects in equations (1) and (2). When no CoO can be selected (no association is sufficiently strong), inventors are treated indifferently as locals or as foreigners from an unknown CoO.
Nationality, however, is not the ideal benchmark, as it tends to be overly restrictive. Migrants can, for example, acquire nationality if they reside long enough in the US, and any child born in the US to foreign parents is a US citizen in accordance with the jus soli principle (the former is the case of inventor Raijv Laroia, who we know to be an Indian-born US national). Indeed, some ethnic minorities may be composed largely of destination-country nationals, and yet remain cohesive over several generations. Table 1 shows the percentage of inventors in our database from each of the ten selected CoO (column 1) alongside the analogous percentage of inventors listed as nationals from the same countries in PCT data (column 2). Figures in column (1) are always higher than those in column (2), as expected. Columns (3) and

Table1 HERE
(4) report the results of z-test on proportions, which indicates these differences always to be significant.
They can also be very large, as for Germany, Italy, Poland, and, above all, Iran. This latter case is highly instructive since, as Iranians have very distinctive names and surnames, so no large error can be attributed to our algorithm. More likely, we explain the difference with the Iranian inventors from the generation that fled the Islamic revolution in the 1980s and their descendants. Both are now US nationals, and yet they form quite a distinct community, one that could play a key role in their home country, should there be a change of regime (Modarres 1998, Modarresi 2001, Mostofi 2003.
In sum, nationality as an indicator of foreign origin is imperfect. Yet, in the absence of a better alternative, we use it to calibrate our algorithm as well as to conduct robustness checks.

Sampling
We select all patent applications from the EP-INV database, with priority years between 1990 and 2010, and for which at least one inventor, resident in the US, reported a CoO among the ten selected. Our initial sample includes 88,522 inventors and 174,160 patents. Of these we retain only those applications receiving at least one forward citation from another EPO patent application (either directly, or indirectly, via an equivalent patent in its family). 6 In this way, we build a "national" and an "international" sample, which we use to investigate the diaspora and brain gain effects, respectively.
For the national sample, we retain all cited-citing pairs in which the citing patent comprises at least one USresident among its inventors. We then exclude all self-citations at the applicant level, as well as all selfcitations at the inventor level, where the self-citing inventor belongs to one of the 10 CoO selected. For each citing patent, we randomly select a control patent that satisfies the following conditions: 1. it does not cite the cited patent, 2. it has the same priority year and is classified in the same IPC groups as those of the citing patent 7 , 3. it comprises at least one US-resident among its inventors. 6 On the use of patent families for citation analysis, see Harhoff et al. (2003). For definitions of patent families, see Martinez (2011). 7 As the same patent may be assigned to several IPC groups, our matching criteria require the citing patent and its control to be classified in the same number of IPC groups, and to share them all.
A -14  In the regression setting, observations are "stacked" and flagged by means of the binary variable Citation (equal to one for cited-citing pairs, zero for cited-control pairs). Our dependent variable is then the probability of Citation=1, which we estimate by means of a Linear Probability Model (LPM; Logit estimates, which provide similar results, are available on request). 9 As for regressors, for all patent pairs in the two samples, we produce the following dummy variables: 1. Co-ethnicity: =1 if at least one inventor in the cited patent and one inventor in the citing (control) patent are from the same CoO.
2. Social distance S (with S=0,1,2,>3,+∞): =1 if the geodesic distance between the cited patent and the citing (control) patent is equal to S. Formally: S = min (Sij) where Sij=geodesic distance between inventor i (i=1…I) on the cited patent and inventor j (i=1…J) on the citing (control) patent, as calculated over the entire network of inventors, for all inventors on the cited and the citing (control) patents. Notice that for i=j →S=0. If i and j belong to disconnected network components then: S=+∞. For each t we calculate a 8 Notice that the cited patent may include, alongside the US-resident inventor(s), one or more foreign residents. This means that in our regressions we have to control for the distance between the inventors of the citing/control patents and both the US-and the foreign-resident inventors. 9 LPM is easy to interpret, as its estimated coefficients can be read directly as marginal effects. This is particularly valuable in specifications like ours, which are loaded with interactions. Following Long (1997) and Wooldridge (2003), we consider LPM to be a good approximation of logit and probit models for probabilities between 0.2 and 0.8,. In our case, the baseline probability of the citation event is 0.5, by construction. The predicted probabilities are as follows: only 1% lower than 0.2 (with no negative predictions) and less than 4% higher than 0.8 (with less than 1% nonsensical, higher-than-1 predictions). Several of the papers we cite adopt the same strategy. different network of inventors, based on co-inventorship patterns of all patents with priority years from t-5 to t-1. 10 3. Miles: shortest distance (in miles) between the two patents, based on their inventors' addresses. We take the log of this value with the addition, in some specifications, of a quadratic term. 11 4. Characteristics of the citing (control) patent, as suggested by Singh and Marx (2013) 8. Other measures of country proximity, such as, border-sharing (Contiguous countries), Former colonial relationship, English as a common official language, and Similarity to English, a language similarity index ranging from 0 to 1, adapted from Miguelez (2016). 10 This amounts to assuming that social ties generated by co-inventorship decay after 5 years, unless renewed by further co-patenting. For more details, see Breschi and Lissoni (2009). 11 For each combination of inventors i and j, we calculate the great-circle distance between the centroid of the respective ZIP codes; we then retain the minimum distance. In case of missing ZIP codes, the centroid of the city was used (or the county, if the city's was missing, too). 12 Co-inventors of a given patent may be located in different countries. In the international sample no inventor of the citing (control) patent can be located in the US, but nothing impedes two inventors in the cited and citing (control) patents from both being located outside the US and in the same country, which is not necessarily the CoO of the inventor(s) of the cited patent.
9. Same company: =1 if applicants of the cited and the citing (control) patents are the same.
10. Returnee: =1 if the inventor of the cited and the citing (control) patents are the same (notice that this implies Social distance 0 = 1) Table 3 reports the descriptive statistics for all variables in both samples (details by country available on request). For the brain gain regressions we did not retain the observations relating to Iran and Poland, given the small numbers involved. This reduces the international sample from 1,048,258 to 1,004,950 observations. Notice that a cited patent enters our sample as many times as the number of citations it receives. The same applies to each patent citing more than one cited patent, though this tends to be less frequent. This necessitates correcting for non-independence of errors, which we do by clustering errors by cited patent. Table 4 reports the results of three specifications of equation (1), without distinguishing by CoO. The first specification reproduces Agrawal et al.'s (2008) basic exercise; the second and third introduce social distance between inventors. Two further specifications (unreported) include further controls, first for patent characteristics, including technology fixed effects, then for spatial distance.

Table 4 HERE
The estimated coefficients in column (1) present the same sign and are of the same order of magnitude as those in Agrawal et al. (2008): co-ethnicity positively affects the probability of observing a citation link between two patents, but its marginal effect is smaller than that of MSA co-location. The interaction term between co-ethnicity and co-location is negative. This suggests that a diaspora effect exists, and it is a substitute for co-location.
When controlling for social distance on the network of inventors (column 2), the estimated coefficients for co-location fall sharply, as the former affects negatively the probability of citation but is positively correlated with spatial distance, in line with previous findings of Breschi and Lissoni (2009). The marginal effect of coethnicity also falls, but not so noticeably (the interaction terms remain unaltered). At first sight, this suggests that co-ethnicity is not correlated with social distance as strongly as co-location.
Estimates in column (3), where we interact social distance on the network of inventors and co-ethnicity, qualify this result. Here, the interaction terms are positive and significant for social distances higher than three degrees. This indicates a substitution effect. Social ties based on ethnicity only matter when those based on professional experience (co-inventorship) are too loose. However, social distance on the network of inventors is generally associated with larger marginal effects than co-location or co-ethnicity.
Controlling for the patent's characteristics (claims, backward citations, NPL citations, overlap IPCs 7, and overlap IPCs) does not alter the coefficients of interest greatly, which is indicative of the robustness of the refined JTH sampling scheme we adopted. Adding controls for spatial distance (Same State and ln_miles, also in quadratic form) further alters the estimated co-efficient of co-location, but does not change the social distance and co-ethnicity estimates. (For both specifications, results are available upon request) In Table 5 we allow the estimated coefficient of co-ethnicity to vary across CoO, first without any interaction with MSA co-location (column 1), then with interaction (column 2). The importance of co-ethnicity for the probability of citation varies across CoO, with its estimated coefficient being clearly positive and significant for Asian countries (albeit unstable across specifications for Japan and Iran), Russia, and Germany (again unstable). Marginal effects appear to be largest for Russia, followed, in descending order, by China, Iran, India, South Korea, Japan, and, at some distance, by Germany. As for the interaction term, this is negative and significant only for China and India, and either positive or negative, but never significant for all the other CoO. This suggests that, overall, the substitution effects between physical and ethnical proximity are driven mostly by Chinese and Indian inventors. The coefficients of social distance and other controls (unreported) do not differ much from those in Table 4. As a robustness check, we re-examine our evidence by replicating the  case-control methodology, as adapted by Singh and Marx (2013). We first assign to all the cited-citing patent pairs (and relative cited-control pairs) two new dummy variables (applicant and examiner), which take value one or zero according to the origin of the citation. 13 We then interact co-ethnicity with the new dummies in the citedciting patent pair. Finally, we test whether the estimated coefficients for the two interaction terms are the same (F-test). If the hypothesis is rejected, and the coefficient for co-ethnicity*applicant is larger than that for co-ethnicity*examiner, we can conclude that ethnicity matters. Columns 1 and 2 in table 6 report the results of two regressions very similar to those in tables 4 and 5, respectively, but with the interactions we just described and the F-test results just below each pair of coefficients. For ease of exposition, results for the second regression (column 2) are arranged over five columns and two lines. The sample reduces from 1,043,320 to 1,005,592 observations, due to lack of information on the origin of several citations. Our findings are mostly in line with those we obtained with the JTH methodology. For the general co-ethnicity dummy (column 1) as well as for China, Germany, India, and Japan (column 2), the coefficient for coethnicity*applicant is larger than that for co-ethnicity*examiner and we reject the null hypothesis. For France, Italy, and Poland, on the contrary, the hypothesis cannot be rejected, again in line with our previous results.
Contrary to what we expected, we cannot reject the hypothesis for Iran and Russia, but even in these cases the coefficient for co-ethnicity*applicant is larger than that for co-ethnicity*examiner (for Russia, we are close to rejecting the hypothesis at 90%). The only odd case is that for Korea, for which the co-ethnicity effect is significantly stronger in the examiner citation case. 13 Differently than  and Singh and Marx (2013), however, we do not deal with citations reported on documents by one national patent office only (in their case, the USPTO; in ours, the EPO). This would make the applicant vs examiner distinction highly dependent on the specific procedures of that particular office. We consider instead all documents in a patent family. We then define a citation as coming from the applicant if and only if it appears as such on all documents in the family (all examiners throughout patent offices worldwide ignored the cited prior art). We consider a citation to be coming from the examiner if it appears as such on at least one document in the family (at least one examiner took notice of the cited prior art). More details in Appendix 3.

Table 6 -HERE
Cross-country differences in the estimated diaspora effect may depend either on the demographic composition of ethnic groups (shares of first vs second-and subsequent-generation migrants) or on their social structure (social cohesiveness). Some of these characteristics depend, in turn, on how well we calibrate our algorithm for each specific CoO. The lower the precision, the more likely we are to mix first or second generation migrants with locals with the same ancestry, but no connections (e.g. young Italian PhD students at Yale with Italian-Americans in New Jersey), or with migrants from different CoO, but a common language (e.g. French vs Quebecois; or Germans vs Austrians and Swiss). In Appendix 2, we compare, among other things, our data with US census data on ancestry. We find measurement errors to be most likely for German inventors, followed at a considerable distance by Italians and, at an even further distance, by French and Polish.
One way to assess the relative weight of substantive factors vs measurement errors is to employ a different definition of foreign-origin inventor. In Table 7 we exploit information on inventors' nationality, which is a more stringent definition (although not necessarily more appropriate, as discussed in section 3). This reduces the sample to around a fifth of its initial size. We then run two sets of regressions: in the first, we maintain co-ethnicity as our explanatory variable of interest; in the second, we replace it with co-nationality. When comparing the estimated coefficients for co-ethnicity and co-nationality across the same specifications (columns 1 and 3, and columns 2 and 4, respectively), we note that, in general, the latter is larger. This suggests that our definition of foreign-origin inventors may present the errors described above. However, our results do not change substantially. Coefficients for Poland remain negative, while those for France and Italy do not become significant (although we observe a change of sign for France). For Russia, both coethnicity and co-nationality are positive and significant, and do not differ much.
The last column in Table 7 reports the results of a Wald test on the null hypothesis of the coefficients for coethnicity and co-nationality to be equal. The results are counter-intuitive, because small (large) differences in the coefficients often correspond to very small (large) standard errors. The hypothesis is only rejected at the 95% confidence interval for India and (almost) at 90% for China, whose coefficients are in any case large and significant in both regressions, thus confirming that a diaspora effect exists. Overall, this suggests that, with the exception of Russia and (to less extent) Germany, no European country exhibits a diaspora effect, and this is not just a statistical artefact due to measurement error problems. Notice also that, for each of the first five technological classes we have many more observations than in the last two, a disproportion that would have not been observed had we sampled local as opposed to foreignorigin inventors (patents in the EP-INV dataset are quite evenly distributed across the seven classes). This is in line with the over-representation of foreign-born inventors in US high technologies. Appendix 4 reports the results of additional robustness checks. First, in Tables A4.2 and A4.3, we test whether our results depend exclusively on the most important US high-tech clusters, which attract a disproportionate number of highly skilled migrants (Kerr 2009). Second, in Table A4.4, we consider the possibility of cohort effects, with different generations of migrant inventors having different propensities to share knowledge with members of their communities. In both case our main results remain unchanged.
Third, in table A4.5 we consider the possibility that the high significance of several coefficients in Tables 4 and 5 depends exclusively on our very large sample size. We apply the bootstrap techniques described by Greene (2008, p.596) and Wooldridge (2003, p.378) to specifications (2) in Table 4 and (1) in Table 5. While standard errors increase, estimated coefficients remain significant for India and China, as well as for Russia (with only one exception).

Results: international knowledge flows and the brain gain effect
Coming to the brain gain effect, column (1) in Table 8 reports the results of our baseline regression. Of the three countries with the strongest diaspora effect (China, Russia, and India), only the former two also exhibit and a positive and significant coefficient for Home country. As for the other countries, the coefficient is positive and significant only for South Korea and France. This suggests that the diaspora effect does not translate necessarily into brain gain, and vice versa. We are also interested in assessing how much of the brain gain effect may pass through multinationals, rather than intra-ethnic spillovers. In this respect, white bars in Figure 2 show the percentage of homeresident inventors (of either the citing or the control patents) who work for the company that owns the cited patent, for the ten CoO in our sample. We test for this in column (2) of Table 8. We observe that, when interacting Home country with Same company, the positive effect of Home country for France disappears, while the coefficient of the interaction term is positive and significant. A similar pattern can be detected for other advanced countries, including Italy and Japan, but not for South Korea (where the interaction term is negative) nor, more interestingly, for any BRIC country. This suggests that US-resident foreign-origin inventors from advanced countries transfer knowledge back home mainly through the multinationals they may work for, rather than through personal contacts.
Interestingly, Germany behaves neither like France and the other advanced countries, nor like the BRICs and South Korea. That is, neither its inventors nor companies seem to have privileged access to knowledge produced by migrant inventors in the US. We explored the possibility that this result might be due to measurement errors, caused by the presence of many German inventors in Swiss companies and/or confusion between German, Swiss, and Austrian inventors when using our algorithm. But this appears not to be the case. 15 We finally explore the role of returnee inventors as a brain gain channel. In all columns of Table 8 we observe a positive and significant coefficient for Returnee. However, returnees are very few in number (0.1% of all observations vs. 3% for Same company), so they are an unlikely channel for massive knowledge flows.
Another important cross-country difference may refer to absorptive capacities. Grey bars in Figure 2 report the percentage of home-resident native inventors (of both citing and control patents), by country of origin.
For the most advanced countries we observe values over 70%, indicating that native inventors are disproportionately more active at home than abroad. The opposite hold for the less advanced ones. This implies that while the former host within their borders the vast majority of potential beneficiaries of coethnic knowledge flows from the US, the same does not hold for the latter. Indeed, migrant inventors from less advanced countries are so many, and so dispersed around the world that an 'international diaspora' effect may exist, to the benefit of several countries of destination, instead or besides the home country. We test for this in column (3) of Table 8. There we replace Home country with Co-ethnicity, which indicates the existence of an ethnic tie irrespective of the inventor's country of residence. Results remain the same for China and Russia, but not for India, whose coefficient is now positive and significant. This suggests that, for this country, an "international diaspora" may exist, along with no benefits for the home country. 16 In order to explore this finding further, Table 9 reports the results of a regression exercise limited to just the BRIC countries in our sample. We allow for the simultaneous presence, among the regressors, of Home country and Co-ethnicity, plus their interaction. For China and Russia, the coefficients of both variables remain positive and significant, while for India it is so only for Co-ethnicity (the interaction terms are never significant). This is further evidence that, in the case of China and Russia, the brain gain and the international diaspora effect co-exist, while for India no brain gain is detected. Notice that this result is compatible with findings by Agrawal et al. (2011) andBranstetter et al. (2015). 17 Table 9-HERE

Discussion and conclusions
Drawing on patent and inventor data, we have investigated whether ethnic ties help in the diffusion of technical knowledge among foreign-origin inventors active in the same country of destination (diaspora effect) and back to their home country (brain gain effect). Our study has focused on the US as destination 16  17 For the international sample, we did not perform the robustness check based on  applicant vs examiner approach. In the international sample, in fact, the share of applicant citations drops dramatically, since patent offices outside the US do not impose any duty of candor rule (see appendix 3 for a detailed discussion). In addition, when the USPTO examines patent applications according to the PCT procedure -which is quite likely in the international sample -it treats all citations inserted by foreign patent offices as they were from applicants, while in reality most come instead from examiners of such offices (Alcacer et al., 2009). Therefore, the distinction between applicant-added and examiner-added citations becomes too blurred to be useful. country and on five Asian and five European countries of origin, selected from among the main sources of highly skilled migration to the US.
Our empirical exercise has exploited a large, original dataset, based on disambiguated inventor data and the linguistic analysis of names and surnames. We also conducted robustness checks based upon inventors' nationality for a sizeable subsample.
We find evidence of a diaspora effect for all Asian countries in our sample (China, India, South Korea, and, to a lesser extent, Japan and Iran) and for two European countries (Russia and, to a much lesser extent, Germany). However, the marginal effect of co-ethnicity is secondary to that of proximity in physical space (co-location at the city level) and in the social network of inventors. In addition, co-ethnicity ties appear to be relevant for social-network-distant inventors. Substitutability holds too, for spatial proximity, especially for Chinese and Indian inventors, as already found (for India) by Agrawal et al. (2008).
In the case of the brain gain effect, ethnic ties do not necessarily imply a knowledge transfer to the home country. Specifically, we find no evidence for one of the main diasporas in the US, namely the Indian one.
This may be attributable more to the absorptive capacities of the country of origin than to the international dimension of the diffusion process under consideration. In fact, for both India and the other BRIC countries in our sample, we find evidence of an international diaspora effect, which presents certain analogies with findings in the trade literature (Felbermayr et al. 2010). In contrast, any brain gain effect for France, Italy, and Japan, seems mediated by multinationals.
Despite imperfections in our name-based method for identifying migrants, our results appear robust enough to rule out major problems of measurement error. Still, while we highlight differences between the migrants' countries of origin, we can only speculate as to their causes. We suspect them to lie in the cohort composition of foreign-origin communities or to their composition by migration channel. In the first case, we refer to the different ratio between first-and subsequent-generation migrants, which is higher for Asian countries of origin (plus Russia), as opposed to European ones. As ethnic ties may be stronger for firstgeneration migrants, this could explain some of the observed differences in the diaspora effect. As for A -25 channels, migration from the BRICs may be occurring more frequently via the higher education system, and that from advanced countries via multinationals (Kerr et al. 2016) .

Foreign-origin inventors in the US: Testing for Diaspora and Brain Gain Effects
A -39  (1) and (2) include controls for : Physical and social distance between inventors ; Citing patent's characteristics ; Technology F.E.

Foreign-origin inventors in the US: Testing for Diaspora and Brain Gain Effects
This version: 03 March 2021

Appendix 1 -Inventor names' disambiguation
We discuss here a few technical issues concerning name disambiguation, both general and specific of studies based on name and surname analysis, like ours. We also present succinctly one key feature of Massacrator 2.0, the name disambiguation algorithm at the basis of the EP-INV database, which we use for our research (for a detailed description, see Pezzoni et al., 2014).
Name disambiguation algorithms can be roughly classified into two groups: rule-based and Bayesian. Here we deal only with the former (for the latter, see: Li et al., 2014, and. 18 A key element of rule-based name disambiguation algorithms consists in measuring the edit or phonetic distance between similar names/surnames, and setting some thresholds under which different names/surnames are considered the same ("matching"). Further information contained in the patent documents, as well as benchmarking is then used to validate the matches ("filtering"). Ideally, a good algorithm would minimize both "false negatives" (maximise "recall") and "false positive" (maximise "precision").
Precision and recall rates are measured as follows: where: ( ) = number of true (false)positives ; ( ) = number of true (false)negatives.
False negatives occur whenever two inventors, whose names or surnames have been spelled or abbreviated differently on different patents, are treated as different persons. False positives occur when homonyms and quasi-homonyms are treated as the same person. Unfortunately, a trade-off exists between the two objectives, which requires making choices based on the consequences of each type of error for the subsequent analysis.
The three most important consequences for the analysis of ethnic citations are: 1. High precision/Low recall algorithms lead to underestimating the number of personal self-citations and overestimating that of co-ethnic citations. This is because all variants of the same inventor's name and surname will be, most likely, classified as belonging to the same ethnic group (for example, "Vafaie Mehrnaz" and "Vafaie Mehranz" will be both classified as Iranian, but a low recall algorithms may end up treating them as different persons, when instead they are one). When considering the two most important countries of origin of migrant inventors in the US, China and India, and before disambiguating inventors, we calculate a co-ethnic citation rate of respectively 20.5 and 15.2, which drop at 18.8 and 13.3 if we recalculate it after disambiguation. When applying the JTH methodology, this problem can be magnified by the presence of very prolific inventors, who are responsible for a large number of both cited and citing patents, and thus have the potential to generate a large number of false co-ethnic citations.
2. High precision/Low recall algorithms may also lead to underestimating the number of returnee inventors. If one Russian inventor patent as "Yavid Dimitriy" and as "Yavid Dimitriy" in Russia, he will not be counted as a returnee (but his self-citations will be counted as a knowledge flow mediated by ethnicity). However, we suspect this to be a relatively minor problem, as figures of returnee inventors appear too low for their order of magnitude to change with a change in algorithms.
3. When applied to inventor sets from different countries of origin, the same matching rules return different results in terms of pre-filtering precision and recall, due to cross-country differences in the average length of text strings containing names and surnames, and in the relative frequency of common names and surnames. Chinese and Korean names and surnames, for example, are both short (which makes it arduous to tell them apart on the sole basis of edit distances) and heavily concentrated on a few, very common ones (such as Wang or Kim). The opposite holds for Russian surnames.
Three complementary strategies may help tackling these problems. The first one consists in making the best possible use of the contextual information contained in patents (that is, to correct for matching errors at the filtering stage). The second consists in using different algorithms to produce more than one datasets, each of which with different combinations of precision and recall, and using them to test the robustness of results. The third one consists in calibrating the disambiguation algorithm by collecting information on linguistic specificities of each country of origin, and exploit them at the matching stage. The information retrieval and computational costs increase when moving from the first to the third strategy. For this reason, Massacrator 2.0 does not follow the third one.
Massacrator 2.0's matches inventors on the basis of edit distances between all tokens comprised in the inventors' name-and-surname text strings, and then filters the matches by exploiting information on both the inventors and their patents. 19 Massacrator 2.0 does not produce a unique dataset, but several ones, each of which is calibrated against a benchmark dataset in order to return a different combination of precision and recall. For this paper we started from the "balanced" calibration (which returns a precision rate of 88%, and a recall of 68%, when tested against a benchmark of French inventors) and slightly modified it. The modification consists in considering as positive cases (that is, the same person) all matched inventors whose patents are linked by at least one citation, irrespective of other filter criteria. This presumably allows for higher recall, and directly address the problem of over-estimation of ethnic citations.
To the extent that this modification induces higher recall at the price of lowering precision, it may lead to over-estimating the phenomenon of returnee inventorship (when the same inventor is first found to be active away from her country of origin, and then back to it). As shown in the paper's descriptive statistics, we find very few cases. Whether true or false positives, they are unlikely to affect our findings. 19 As an example, consider "Dmitriy Yavid", a Russian inventor with a 2-token name-and-surname text string, and his fellow countryman "Sergei Vladimirovich Ivanov", with a 3-token name-and-surname string. As all of their tokens are pretty different, the two inventors will not be matched. Instead, "Dmitriy Yavid" and "Dimitriy Victorovich Yavid" will be matched, as, of the former's two tokens, one is identical to a token in the latter's, and another differs for just one character. The "Dmitriy Yavid" -"Dimitriy Victorovich Yavid" match will be then retained as valid if the two inventors' patents are either similar in contents, citation patterns, priority year, location in space, or property regime (same applicant); or if the two inventors have common co-inventors, or coinventors who worked together. Otherwise they will be discarded as false matches.

Appendix 2 -Ethnic classification of inventors
When fed with a name and/or a surname, the IBM-GNR system returns a list of Countries of Association (CoAs) and two main scores: 20 -"frequency", which indicates to which percentile of the frequency distribution of names or surnames the name or surname belongs to, for each CoA; -"significance", which approximates the frequency distribution of the name or surname across all CoA. 21 The IBM-GNR list of CoAs associated to each inventor is too long for being immediately reduced to a unique country of origin for each inventor in our database. This operation requires filtering a large amount of information through an ad hoc algorithm, one that compares the frequency and significance of the two lists of CoAs associated, respectively, to the inventor's name and surname to the inventor's "country of residence" at the moment of the patent filing (which we obtain from the inventor's address in the EP-INV dataset). Figure A2.1 illustrates the type of information provided by IBM-GNR, the position of our algorithm in the information processing flow, and the final outcome. Notice that we refer to "country of association" (CoA) when considering the raw information from IBM-GNR, and to "country of origin" when considering the final association between the inventor and one of the many CoAs proposed by IBM-GNR (or one of our "meta-countries" based on linguistic association). The full description of the algorithm is as follows: I. We consider only inventors in the EP-INV database with at least one patent filed as US residents, or who cite at least one patent filed by US residents, and we assign them to either one of the 10 CoO of our interest, or leave her "unassigned" (which means she may be either a US "native" -whatever this might mean -or a migrant from other countries) II. The 10 CoO of our interest are China, India, Iran, Japan, and South Korea (for Asia) and France, Germany, Italy, Poland, and Russia (for Europe). They share two characteristics: they belong to the top 20 CoO of highly skilled migrants in the US, according to OECD/DIOC stock figures for 2005/06 ; and their official language is neither English nor Spanish, which is a prerequisite for our algorithm to make sense when applied to migration into the US. 22 III. For each inventor, we consider three indicators: a. The frequency of her first name(s) in English-and Spanish-speaking CoA 23 b. The product of the significances attached to her name and to the surname, for each CoA coinciding with one of the 10 CoO of our interest. Notice that, in principle, we could find that an inventor is associated to more than one of the 10 CoO of our interest, either via her name or her surname (for example, a French inventor of Italian descent may have a French name and an Italian surname). However, these cases are very few. 20 Information on IBM-GNR reported here comes from IBM online documentation (http://www-01.ibm.com/support/knowledgecenter/SSEV5M/SSEV5M_welcome.html?lang=en; last visit: 19/1/2015) as well as: Patman (2010) and Nerenberg and Williams (2012). E-mail and phone exchanges with IBM staff were also decisive to facilitate our understanding. Still, being IBM-GNR a commercial product partly covered by trade secrets, we did not have entire access to its algorithms and we had to reconstruct them by deduction. For an application to a research topic close to ours, see Jeppesen and Lakhani (2010). 21 For example, an extremely common Vietnamese surname such as Nguyen will be associated both to Vietnam and to France, which hosts a significant Vietnamese minority; but in Vietnam it will get a frequency value of 90, while it France it will get only, say, 50, the Vietnamese being just a small percentage of the population. When it comes to significance, the highest percentage of inventors names Nguyen will be found in Vietnam (say 80), followed by France and several Asian countries, with much smaller values. 22 Language is an issue to the extent that our tools cannot distinguish English-speaking migrant inventors from US ones, nor Spanishspeaking migrants from one country of origin or another. This is why we cannot include in our analysis important origin countries such as the UK, Canada, Mexico and Cuba. We also have not yet included Ukraine and Taiwan, as this will require merging them with Russia and China, respectively. Two other countries in the top 20 list we have not included are Vietnam (too few observations among inventors) and Egypt (whose migrants into the US we cannot tell apart from those from other Arab-speaking countries). 23 The intuition is as follows. An inventor with a typical Indian surname, such as Laroia, but named John or Luis is unlikely to be a recent Indian migrant into the US; this is because John and Luis are high-frequency names, respectively, in English-speaking and Spanish-speaking countries (among which we count US). More likely, he will be born in the US, possibly from mixed parents. On the contrary, Rajiv Laroia is more likely to be a first -generation Indian immigrant, as Rajiv is high-frequency name in India, a zerofrequency name in Spanish-speaking countries, and a low-frequency name in English-speaking countries that host Indian minorities.
c. The significance attached to the surname in the CoA associated to indicator n.2. 24 As a result, we will have, for each inventor, one (or very few) candidates CoO and three indicators of potential success of this "candidacy". IV. We set six possible threshold values for indicator n.1 (from 10 to 100, with steps of 20), eleven threshold values for indicator n.2 (from 0 to 10000, with steps of 1000), and six threshold values for indicator n.3 (from 50 to 100, with steps of 10). We consider 102 combinations of such threshold values ("calibrations"), and for each combination we assign each inventor to one or another CoO (or to no CoO at all). Each inventor is therefore associated to one vector of 102 dummies (one for each calibration) and a specific CoO, with dummy=1 indicating that the inventor comes for that CoO, and dummy=0 that she does not (no CoO assigned). 25 V. We apply steps I. to IV. also to inventors in the WIPO-PCT database by , which report the inventors' nationality, which we use as benchmark to evaluate the precision and recall rates obtained by each calibration, for each CoO. We then identify Pareto-optimal calibration, namely the calibrations whose precision rate cannot be improved upon without losing out on the recall rate, and viceversa (blue dots in figures A2.2, which report the calibration results for China and Italy). Notice that the Pareto-optimal calibrations are not necessarily the same for all CoO; again from figure A2.2, one can see that the distribution of Pareto-Optimal calibrations for China is more convex than the one for Italy. In other words, the sharpness of trade-off between precision and recall differ across CoO: while for Italy we can attain a 70% precision rate only at the cost of reducing the recall rate to 10%, for China we reduce the latter only to 60%. The precision-recall trade-off can be considered a measure of the quality of our algorithm, per country. In general, quality is higher for Asian countries (with the exception of Iran) than for the European ones. VI. Finally, we retain for our analysis two calibrations per CoO: a "high recall" calibration (one that ensures the highest recall value, conditional on precision being at least 30%); and a "high precision" calibration, one that requires precision to be no less than 70% . High recall values may include a large number of false positives (inventors wrongly assigned to one or another of the 10 CoO of interest), but also accommodate for a looser definition of migrant inventors, one that includes late-generation migrants. The latter's validity depends on the strength of ties binding such migrants to other US residents of the same descent and/or to their countries of origin (on which we have no a priori information).
In the present version of the paper, we make use only of "high recall" calibration results. To further compare data quality across CoO, we inspect the frequency distribution of values taken by indicator n.2 ( figure A2.3). The more right (left) skewed the distribution, the better (worse) the quality: the most striking comparison here is between India and Italy, with the former clearly exhibiting higher quality. According to this measure, too, quality is generally higher for Asian countries (with the exception of Iran) than for European ones. 24 The intuition is as follows: the indicator n.2 may have a high value due exclusively to a very high value of the significance for the name, with a moderate value for the significance of the surname. We wish the latter not to be too low. 25 Keeping with the example from the previous footnotes, Rajiv Laroia will be associated to CoO=India, with a vector containing n<102 zeroes and 102-n ones. The ones are all associated with "high recall" combinations of high threshold values for indicator n.1 and low threshold values for nr.2 and nr.3 (such as, respectively, 70-5000-60; see figure 1), while the zeroes will be associated with "high precision" combinations (low threshold values for indicator n.1 and high threshold values for nr.2 and nr.3; such as, respectively, 30-8000-80). Rajiv Laroia will be confirmed having CoO=India only in the high recall case, but not in the high precision case (for which indicator nr.1 is too high). In practice, the high precision combination leaves the door open to Rajiv Laroia's CoO being the UK, and to Rajiv Laroia being possibly of Indian descent, but with no ties to India or to Indian migrants in the US.
A -48 This is confirmed by a comparison between the distribution by CoO of our inventors and comparable distribution obtained from censual data. Table A2.1 reports information drawn from IPUMS-USA data for year 2000 (https://usa.ipums.org/usa/), namely:  The percentage share of US residents with 4+ years of college education, born outside the US, by country of birth (aged 15 and above)  The percentage share of US residents (all education levels, aged 15 and above), born in the US but of foreign ancestry, by ancestors' country. 26 The two shares are compared to the shares of inventors of foreign origin in our database, for inventors with at least one patent in year 2000. The same information is displayed in figure A2.4, with ancestry information on the right axis.  26 Ancestry is an information provided by census respondents, which is subsequently recoded but not verified by census officials; respondents with mixed ancestry typically pick one, or rarely two, according to their own identity feelings; and census official recode, but not check the information. College-educated US residents are the best proxy for inventors we can get from censual data, based on the reasonable assumption that most inventors hold a college degree (especially in science-based fields, which we know to be the most affected by immigration). As for the share of US-born residents of foreign ancestry, this is indicative of the presence of many non-English surnames, and possibly names, which may induce the Ethnic-Inv algorithm to classify an inventor as of foreign origin, when in fact he or she maybe the descendant of 19 th -20 th century migrants.
We observe the share of college-educated foreign born to be very similar to that of inventors of foreign origin for Iran, Korea, Poland, Russia, and, to less extent, Japan. We take it as a suggestion that the Ethnic-INV algorithm does a relatively good job in these cases.
For China and India, the percentage of foreign-origin inventors is much higher than that of college-educated US residents; but we can explain that with the recent migration boom of scientists and engineers, as confirmed by many sources in the literature. At the same time, we observe that the percentage of USresidents with Chinese or Indian foreign ancestry is relatively small, which rules out a misclassification of the latter in the Ethnic-Inv database. The opposite holds for Germany, France and Italy, where again the percentage of foreign-origin inventors is much higher than that of foreign-born college-educated residents, but: (1) the literature does not suggest, as for China and India, a recent migration wave of scientists and engineers; (2) the percentage of US residents of foreign ancestry is very high, which suggests misclassification in the Ethnic-Inv database.
The problem appears to be particularly severe for Germany, where the difference between college-educated and inventors is very large, and the percentage of US residents of German ancestry is very high. We further check the reliability of our data by comparing them to both WIPO-PCT data (which, as said above, provide information on nationality of inventors) and to estimates by Kerr (2008), who also uses a namebased ethnicity assignment algorithm, based on a different source than IBM-GNR (and for a more limited spectrum of countries of origin).  (3) and (4) report the results of z-test on proportions, which indicates these differences always to be significant. This is expected, as long-term migrants have the possibility to acquire US nationality over the years (and a cursory look at WIPO-PCT data suggests this to be the case, with some prolific inventors who declare different nationalities in their early vs late patents).  . (iii) Normalized difference between (1) and (2) (iv) p-values for z-test on H0 : (1) = (2) Still, we observe cross-country variations that may be due either to lack of precision in the Ethnic-INV algorithm or to differences in the propensity to acquire nationality for each migrant community (this in turn may be due to the average time spent by the migrants in the US, the number of US-born second-generation migrants, and the frequency of mixed marriages). In particular, we notice larger differences, in relative terms, for Germany, Italy, and Poland, where the share of foreign nationals is about double the share of foreign-origin inventors. Still, the differences for both Italy and Germany are much more limited than the ones observed in table A2.1 (comparison with college-educated foreign residents).
With a 3:1 ratio, Iran is a special case, as we know that neither Iran is an historical country of origin of US immigrants; nor Iranian surnames lack of distinctiveness. Hence, we conclude that many Iranian inventors may be part, or the immediate descendants, of the migration wave following the 1979 revolution, later to acquire (or obtain at birth by ius soli) the US citizenship.
We finally compare our data with those published by Kerr (2008) for a more limited set of countries of origin (China, India, Japan, Korea and Russia) and patents granted by the USPTO. 27 Figure A2.5 reports the share EPO patent applications by US residents of foreign-origin inventors, over the total of US residents' applications, from 1980 to 2010, for the 10 CoO of our interest. The observed trends are very similar, with the only exception of Indian inventors' patents in the 2000s, for which Kerr observes a decline and we do not. As for values, they are in the same order of magnitude but with our data exhibiting generally lower shares especially for Russia (from little more than 0% to around 1%, as opposed to 3% to 4.5% for Kerr), and with the exception of India (our share being overall 1% point higher then Kerr). 27 Kerr considers "ethnic groups", as defined by the Melissa database for ethnic marketing, rather than specific CoO, namely: Chinese, Indian, Japanese, Korean and Russia, which correspond more or less to our CoO; Vietnam, which we do not consider; and European and Hispanic, which are too large aggregations of CoO for being of our interest. Table A3.2 reports the distribution of citations across origin (applicant vs examiner), by source (patent authority). In the national sample, the citations from patent authorities other than the USPTO are mostly examiner citations, while the opposite is true for the USPTO. This is expected, due to the peculiarity of the US system (duty of candour rule). But when we move to the international sample, the proportion of applicant vs examiner citations from the USPTO is reversed. Thus, non-US applicants (from whom most citations in the international sample come from) provide, in relative terms, many fewer citations than US ones, as they do not conform to the duty of candour rule.  Table A3.3 reproduces the results of an OLS regression that replicates on our data the exercise conducted by Lampe (2012). Lampe (2012) finds that "applicants withhold between 21% and 33% of relevant citations known to them" (p.320). He obtains this result by comparing the citations introduced by examiners on a focal patent to the citations introduced by the focal patent's applicant in its own previous applications. He finds that, very often, applicants cite some prior art in a patent filing at time t (which proves they were aware it), but do not do it at time t'>t, when it is the examiner who does it, thus providing evidence that the prior art was relevant.

Figure A2.5 -Ethnic inventors' share of EPO patent applications by US residents; by Country of Origin
Thus, in Table A3.3, the observation set pools all citing-cited pairs from both the national and the international samples, but retains only those in which the cited patent was already "known" to the applicant (it appears among the backward citations of the same applicant's prior patents). The dependent variable is binary one, =1 if the citation comes from the applicant (at least one applicant citation in the patent family). All regressors are dummies, indicating whether: -the citation comes from the international sample -the citing applicant is located outside the US (applicant address is a non-US one) -the priority patent in the citing family was first applied for at the USPTO, before 2001 (that is, before the USPTO started releasing information on the origin of the citations) The regression also includes an interaction term between the first two regressors.
We notice that, in line with the descriptive statistics, the citations from the International sample are less likely to originate from the applicant, and the same applies to the citation from non-US firms, with the two effects being complements. Summing up, when dealing with the national sample, we can interpret the applicant vs examiner citations as in  and the related, US-centric literature. This is because most citations come from the USPTO, which ensures both a sizeable proportion of applicant citations and no measurement error in the origin attribution (that is, most applicant citations come indeed from the applicant, albeit it is unclear whether it was the inventor or the attorney to produce them). But when we move to the international sample, we have little hope of reproducing  results, since the share of applicant citations

Appendix 4 -Regression analysis (Diaspora effect): Further robustness checks
We deal with the disparities in the precision of our Ethnic-Inv algorithm by running some robustness checks.
First, we exploit information on the nationality of inventors, for the subset of inventors who also have patents in the WIPO-PCT database. Based on information on patent families provided by PatStat, we first identified all patents in the WIPO-PCT database that are equivalents of EP-INV patents in our sample. Within each pair of equivalent patents we name-matched inventors on the EPO patent to inventors on the WIPO-PCT one: around 90% of positive matches result from perfect name string matching, the remaining from a combination of Soundex matching of surname and first given name (around 9%), 2-gram string matching or manual checking (less than 125). This allowed us to assign a nationality to all inventors in the EP-INV database with at least one patent in the WIPO-PCT database. We then retain only the cited patents (and the related citing and control ones) in which the inventors' countries of origin and of nationality coincide. This reduces the sample to around one fifth of the initial one (see table A4.1). Notice that the distribution by CoO/Nationality is very similar in the two samples. For results and related comments, see table 6 in the paper.    Fourth, we consider the possibility that the high significance of several coefficients in tables 3 to 5 may depend on the very large number of observations in our sample -which may decrease the variance of the estimators. We run again the regressions in table A4.5 with samples of reduced size, by applying the bootstrap technique described by Greene (2008, p.596) and Wooldridge (2002, p.378). As reported in table   A4.5, the coefficients are maintained, but the standard errors increase as the size of the subsamples diminishes. Despite this, significance is always maintained for India and China, as well as for Russia with the exception of the last case (smallest sample). In regressions 4 and 8, with many dummies, not all subsamples lead to convergence, so results are based on a smaller set of replications. Estimates based of 1% subsample do not include the last column, since any of the subsample was able to converge.  Clustered standard errors in parentheses ; *** p<0.01, ** p<0.05, * p<0.1 Table A4.7 -Probability of citation from outside the US, as a function of "home-country" effect, co-ethnicity or co-nationality (also by Country of Origin) --LPM regression HOME COUNTRY CO-ETHNICITY CO-NATIONALITY (1) (2) . 425.5 . § « Home country » in columns 1 ; co-ethnicity in columns 2 ; co-nationality in columns 3 § § F-statistic not computed by our software package due to near-collinearity of some predictors (in particular, the Technology F.E.) Clustered robust standard errors in parentheses, *** p<0.01, ** p<0.05, * p<0.1