From survey participants to agents

In the social sciences, truly relational datasets are relatively scarce, while individual-centered survey data are far more common. Pop2net provides methods for generating networks from such survey data by first creating actors from survey participants and then connecting them based on their empirical attributes. This approach is particularly effective with datasets that include quasi-network information—for example, household membership data found in surveys like the German Socio-Economic Panel (SOEP).

The following basic example illustrates how Pop2net can be used to construct networks from survey data.

Let’s begin by creating an artificial example dataset:

[1]:
import pop2net as p2n
from pop2net.data_fakers.soep import soep_faker

df_soep = soep_faker.soep(size=100, seed=1)
df_soep.head(20)
[1]:
age gender work_hours_day nace2_division hid pid
0 20.0 male 0.000000 -2 2201 583
1 44.0 female 0.000000 -2 2201 868
2 32.0 female 0.000000 -2 1033 262
3 98.0 female 0.000000 -2 1033 121
4 95.0 female 12.613089 86 8117 780
5 38.0 male 0.000000 -2 8117 461
6 52.0 male 6.528070 64 8117 484
7 80.0 female 0.000000 -2 6219 808
8 44.0 male 1.889185 88 6219 215
9 81.0 female 0.000000 -2 6219 97
10 4.0 male 0.000000 -2 6219 500
11 48.0 male 0.000000 -2 464 915
12 20.0 female 5.524712 86 464 856
13 65.0 female 0.000000 -2 464 400
14 66.0 male 0.000000 -2 464 444
15 85.0 male 0.000000 -2 464 623
16 32.0 female 0.000000 -2 34 713
17 49.0 female 7.113968 84 34 457
18 42.0 male 0.000000 -2 34 273
19 97.0 female 0.000000 -2 3748 606
[2]:
env = p2n.Environment()
creator = p2n.Creator(env)

Using the creator, we can sample from the dataset above. While doing so, we can specify a column to serve as the sampling unit. In this example, we use the household identifier “household_id” as the sample_level. Additionally, it is possible to oversample the data, meaning we can generate more sampled actors than there are original rows. Here, we set the number of samples to 500.

[3]:
df_sample = creator.draw_sample(
    df=df_soep,  # set the dataset to be sampled from
    n=500,  # set the number of lines to be sampled
    sample_level="hid",  # set the unit of sampling
    sample_weight=None,  # if needed, set a column as the sampling weight
)

df_sample.head(20)
[3]:
age gender work_hours_day nace2_division hid pid hid_original
0 53.0 male 8.000000 99 1 945 9116
1 5.0 male 0.000000 -2 1 658 9116
2 49.0 male 0.000000 -2 2 721 5054
3 40.0 male 0.000000 -2 2 869 5054
4 95.0 female 12.613089 86 3 780 8117
5 38.0 male 0.000000 -2 3 461 8117
6 52.0 male 6.528070 64 3 484 8117
7 67.0 male 1.644200 20 4 949 4747
8 40.0 male 8.719386 99 4 427 4747
9 28.0 female 0.000000 -2 5 403 8278
10 79.0 male 0.000000 -2 5 604 8278
11 97.0 female 3.453167 87 5 874 8278
12 26.0 female 8.000000 84 5 36 8278
13 67.0 male 1.644200 20 6 949 4747
14 40.0 male 8.719386 99 6 427 4747
15 83.0 female 0.000000 -2 7 751 8023
16 94.0 male 1.653220 99 7 31 8023
17 80.0 male 3.645795 86 7 481 8023
18 4.0 male 0.000000 -2 7 45 8023
19 67.0 male 1.644200 20 8 949 4747

If you look closely, the method used overwrites the column specified as the sample level (e.g., “hid”) with new, unique values. The original values are preserved in a separate column (e.g., “hid_original”). This ensures that the sample level can still be used to connect actors, without introducing artifacts caused by duplicated samples. If desired, this behavior can be disabled by setting the replace_sample_level_column argument to False.

Now we create the actors from the sampled data:

[4]:
creator.create_actors(df=df_sample)
[4]:
EntityList [500 actors]

Now let’s use two types of locations (households and work places) to connect the created actors based on their empirical attributes:

[5]:
class Household(p2n.LocationDesigner):
    def split(self, actor):
        """Build one household location for each hid."""
        return actor.hid


class Work(p2n.LocationDesigner):
    n_actors = 10  # set the number of actors per work place to 10

    def split(self, actor):
        """Build work places separated by industry."""
        return actor.nace2_division

    def filter(self, actor):
        """Connect only actors with valid industry codes."""
        return actor.nace2_division > 0

    def weight(self, actor):
        """Weight the connection between the actor and the location
        by the empirical working time."""
        return actor.work_hours_day
[6]:
creator.create_locations(location_designers=[Household, Work])
[6]:
EntityList [178 locations]
[7]:
inspector = p2n.NetworkInspector(env)
inspector.plot_networks(location_color="label")