From survey participants to agents

In the social sciences, truly relational datasets are relatively scarce, while individual-centered survey data are far more common. Pop2net provides methods for generating networks from such survey data by first creating actors from survey participants and then connecting them based on their empirical attributes. This approach is particularly effective with datasets that include quasi-network information—for example, household membership data found in surveys like the German Socio-Economic Panel (SOEP).

The following basic example illustrates how Pop2net can be used to construct networks from survey data.

Let’s begin by creating an artificial example dataset:

[1]:

import pop2net as p2n
from pop2net.data_fakers.soep import soep_faker

df_soep = soep_faker.soep(size=100, seed=1)
df_soep.head(20)

[1]:

	age	gender	work_hours_day	nace2_division	hid	pid
0	20.0	male	0.000000	-2	2201	583
1	44.0	female	0.000000	-2	2201	868
2	32.0	female	0.000000	-2	1033	262
3	98.0	female	0.000000	-2	1033	121
4	95.0	female	12.613089	86	8117	780
5	38.0	male	0.000000	-2	8117	461
6	52.0	male	6.528070	64	8117	484
7	80.0	female	0.000000	-2	6219	808
8	44.0	male	1.889185	88	6219	215
9	81.0	female	0.000000	-2	6219	97
10	4.0	male	0.000000	-2	6219	500
11	48.0	male	0.000000	-2	464	915
12	20.0	female	5.524712	86	464	856
13	65.0	female	0.000000	-2	464	400
14	66.0	male	0.000000	-2	464	444
15	85.0	male	0.000000	-2	464	623
16	32.0	female	0.000000	-2	34	713
17	49.0	female	7.113968	84	34	457
18	42.0	male	0.000000	-2	34	273
19	97.0	female	0.000000	-2	3748	606

[2]:

env = p2n.Environment()
creator = p2n.Creator(env)

Using the creator, we can sample from the dataset above. While doing so, we can specify a column to serve as the sampling unit. In this example, we use the household identifier “household_id” as the sample_level. Additionally, it is possible to oversample the data, meaning we can generate more sampled actors than there are original rows. Here, we set the number of samples to 500.

[3]:

df_sample = creator.draw_sample(
    df=df_soep,  # set the dataset to be sampled from
    n=500,  # set the number of lines to be sampled
    sample_level="hid",  # set the unit of sampling
    sample_weight=None,  # if needed, set a column as the sampling weight
)

df_sample.head(20)

[3]:

	age	gender	work_hours_day	nace2_division	hid	pid	hid_original
0	95.0	female	12.613089	86	1	780	8117
1	38.0	male	0.000000	-2	1	461	8117
2	52.0	male	6.528070	64	1	484	8117
3	36.0	female	0.000000	-2	2	562	6014
4	65.0	female	9.102887	99	2	720	6014
5	38.0	other	6.834517	86	2	795	6014
6	51.0	male	8.000000	69	2	691	6014
7	53.0	female	0.000000	-2	2	756	6014
8	28.0	female	3.963627	84	2	384	6014
9	97.0	female	0.000000	-2	3	606	3748
10	2.0	female	0.000000	-2	3	968	3748
11	32.0	male	0.000000	-2	4	450	1416
12	37.0	female	0.000000	-2	4	680	1416
13	61.0	male	5.049022	64	4	521	1416
14	53.0	male	8.000000	99	5	945	9116
15	5.0	male	0.000000	-2	5	658	9116
16	53.0	male	0.000000	-2	6	924	1674
17	90.0	male	0.000000	-2	6	326	1674
18	67.0	male	0.000000	-2	7	520	6915
19	49.0	female	0.000000	-2	7	850	6915

If you look closely, the method used overwrites the column specified as the sample level (e.g., “hid”) with new, unique values. The original values are preserved in a separate column (e.g., “hid_original”). This ensures that the sample level can still be used to connect actors, without introducing artifacts caused by duplicated samples. If desired, this behavior can be disabled by setting the replace_sample_level_column argument to False.

Now we create the actors from the sampled data:

[4]:

_ = creator.create_actors(df=df_sample)

Now let’s use two types of locations (households and work places) to connect the created actors based on their empirical attributes:

[5]:

class Household(p2n.LocationDesigner):
    def split(self, actor):
        """Build one household location for each hid."""
        return actor.hid


class Work(p2n.LocationDesigner):
    n_actors = 10  # set the number of actors per work place to 10

    def split(self, actor):
        """Build work places separated by industry."""
        return actor.nace2_division

    def filter(self, actor):
        """Connect only actors with valid industry codes."""
        return actor.nace2_division > 0

    def weight(self, actor):
        """Weight the connection between the actor and the location
        by the empirical working time."""
        return actor.work_hours_day

[6]:

_ = creator.create_locations(location_designers=[Household, Work])

[7]:

inspector = p2n.NetworkInspector(env)
inspector.plot_networks(location_color="label")